[00:00:13] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: logstash1029.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - cwhite@cumin2002"
[00:00:35] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: logstash1029.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - cwhite@cumin2002"
[00:00:35] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[00:00:36] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts logstash1029.eqiad.wmnet
[00:00:38] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.dns.netbox
[00:03:02] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[00:03:03] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts logstash1027.eqiad.wmnet
[00:03:35] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.dns.netbox
[00:04:31] <jinxer-wm>	 FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.267s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:05:56] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[00:05:57] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts logstash1028.eqiad.wmnet
[00:06:34] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.decommission for hosts logstash1026.eqiad.wmnet
[00:09:31] <jinxer-wm>	 RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.17s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:10:57] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 639.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:12:55] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.dns.netbox
[00:13:19] <wikibugs>	 (03PS1) 10Eevans: sessionstore: upgrade to 'dev' (Cassandra 4.1.8) [puppet] - 10https://gerrit.wikimedia.org/r/1122695 (https://phabricator.wikimedia.org/T386969)
[00:14:01] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122695 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans)
[00:19:40] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: logstash1026.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - cwhite@cumin2002"
[00:20:16] <jinxer-wm>	 FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.412s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:23:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1151:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1151 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:25:16] <jinxer-wm>	 RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.12s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:25:46] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.042s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:30:31] <jinxer-wm>	 FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.392s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:33:59] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: logstash1026.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - cwhite@cumin2002"
[00:33:59] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[00:34:00] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts logstash1026.eqiad.wmnet
[00:34:24] <wikibugs>	 (03PS1) 10Tim Starling: CodeMirror: use the EditorView's state property on form submission [extensions/CodeMirror] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1122697 (https://phabricator.wikimedia.org/T387253)
[00:35:31] <jinxer-wm>	 RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.492s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:35:53] <wikibugs>	 (03CR) 10Tim Starling: [C:03+2] CodeMirror: use the EditorView's state property on form submission [extensions/CodeMirror] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1122697 (https://phabricator.wikimedia.org/T387253) (owner: 10Tim Starling)
[00:38:29] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1122698
[00:38:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1122698 (owner: 10TrainBranchBot)
[00:43:16] <wikibugs>	 (03Merged) 10jenkins-bot: CodeMirror: use the EditorView's state property on form submission [extensions/CodeMirror] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1122697 (https://phabricator.wikimedia.org/T387253) (owner: 10Tim Starling)
[00:44:45] <logmsgbot>	 !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1122697|CodeMirror: use the EditorView's state property on form submission (T387253)]]
[00:44:49] <stashbot>	 T387253: Codemirror broken and doesn't recognize changes - https://phabricator.wikimedia.org/T387253
[00:45:40] <wikibugs>	 (03PS2) 10Eevans: sessionstore: upgrade to 'dev' (Cassandra 4.1.8) [puppet] - 10https://gerrit.wikimedia.org/r/1122695 (https://phabricator.wikimedia.org/T386969)
[00:46:25] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122695 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans)
[00:47:48] <logmsgbot>	 !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1122697|CodeMirror: use the EditorView's state property on form submission (T387253)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[00:51:13] <logmsgbot>	 !log tstarling@deploy2002 tstarling: Continuing with sync
[00:51:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.019s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:51:21] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.196 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:51:26] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1122698 (owner: 10TrainBranchBot)
[00:55:32] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122695 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans)
[00:56:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.152s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:57:47] <logmsgbot>	 !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122697|CodeMirror: use the EditorView's state property on form submission (T387253)]] (duration: 13m 02s)
[00:57:51] <stashbot>	 T387253: Codemirror broken and doesn't recognize changes - https://phabricator.wikimedia.org/T387253
[01:00:23] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 94, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[01:00:31] <wikibugs>	 06SRE, 10Bitu, 06Infrastructure-Foundations: Create an IDM for Wikimedia developer accounts - https://phabricator.wikimedia.org/T319405#10581265 (10nshahquinn-wmf) 05Open→03Resolved IDM has been up and running for a long time now, so unless I'm missing something, this is done.
[01:03:56] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Remove $wmgUseGraphWithJsonNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122709 (https://phabricator.wikimedia.org/T124748)
[01:04:53] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host backup1014.eqiad.wmnet with OS bookworm
[01:05:06] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10581281 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm
[01:05:06] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1014.eqiad.wmnet with OS bookworm
[01:05:19] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10581283 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm executed with errors: - backup1...
[01:05:35] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:05:53] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:06:29] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:08:32] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1122710
[01:08:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1122710 (owner: 10TrainBranchBot)
[01:09:07] <tzatziki>	 !log removing 4 files for legal compliance
[01:09:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:09:19] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:09:27] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:09:43] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53514 bytes in 0.144 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:17:00] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host backup1014.eqiad.wmnet with OS bookworm
[01:17:11] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1014.eqiad.wmnet with OS bookworm
[01:17:12] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10581290 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm
[01:17:19] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10581291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm executed with errors: - backup1...
[01:18:11] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host backup1014.eqiad.wmnet with OS bookworm
[01:18:21] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10581297 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm
[01:19:30] <tzatziki>	 !log removing 1 file for legal compliance
[01:19:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:21:46] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10581299 (10Neobeta61) According to the MR Functional Spec import foreign drive happens 'at boot'.  But I used the restart command 'stor...
[01:23:47] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[01:30:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1122710 (owner: 10TrainBranchBot)
[01:50:28] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Deduplicate JsonConfig config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711
[01:54:39] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "Recently touched in:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 (owner: 10Bartosz Dziewoński)
[01:55:41] <wikibugs>	 (03CR) 10Reedy: "recheck" [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1122710 (owner: 10TrainBranchBot)
[01:55:48] <wikibugs>	 (03CR) 10Reedy: "resubmit" [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1122710 (owner: 10TrainBranchBot)
[01:56:31] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1122710 (owner: 10TrainBranchBot)
[02:00:03] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Deduplicate JsonConfig config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711
[02:12:59] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:13:31] <wikibugs>	 (03PS1) 10Arlolra: Turn on Parsoid Read Views for XX wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122712 (https://phabricator.wikimedia.org/T387254)
[02:17:56] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1122710 (owner: 10TrainBranchBot)
[02:18:33] <wikibugs>	 (03PS2) 10Arlolra: Turn on Parsoid Read Views for XX wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122712 (https://phabricator.wikimedia.org/T387254)
[02:32:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1165:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1165 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[02:33:21] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:37:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1165:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1165 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[02:47:35] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1014.eqiad.wmnet with OS bookworm
[02:47:44] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10581369 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm executed with errors: - backup1...
[02:56:43] <icinga-wm>	 RECOVERY - snapshot of s8 in codfw on backupmon1001 is OK: Last snapshot for s8 at codfw (db2198) taken on 2025-02-26 01:44:59 (1176 GiB, +1.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[02:59:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1165:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1165 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[03:03:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.163s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:03:21] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:04:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1165:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1165 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[03:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:08:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.146s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:12:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.152s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:17:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.059s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:19:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 826.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:24:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 939.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:32:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.053s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:37:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.079s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:54:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.08s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:59:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.237s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:59:45] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 977.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:04:30] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.221s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:07:15] <jinxer-wm>	 FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.152s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:13:21] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:17:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.329s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:40:15] <jinxer-wm>	 FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.324s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:50:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.097s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:53:15] <jinxer-wm>	 FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.371s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[05:03:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.081s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[05:03:21] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:04:17] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:04:17] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:09:45] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Idle - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:12:15] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:12:15] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:17:41] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:23:47] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[06:22:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1195 db2153', diff saved to https://phabricator.wikimedia.org/P73602 and previous config saved to /var/cache/conftool/dbconfig/20250226-062234-root.json
[06:22:54] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2153.codfw.wmnet
[06:23:00] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1195.eqiad.wmnet
[06:24:33] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2220 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1122720 (https://phabricator.wikimedia.org/T387270)
[06:24:37] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1122721 (https://phabricator.wikimedia.org/T387270)
[06:25:23] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T387270
[06:25:28] <stashbot>	 T387270: Switchover s7 master (db2218 -> db2220) - https://phabricator.wikimedia.org/T387270
[06:25:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2220 with weight 0 T387270', diff saved to https://phabricator.wikimedia.org/P73603 and previous config saved to /var/cache/conftool/dbconfig/20250226-062535-marostegui.json
[06:26:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2220 from API/vslow/dump T387270', diff saved to https://phabricator.wikimedia.org/P73604 and previous config saved to /var/cache/conftool/dbconfig/20250226-062617-marostegui.json
[06:26:44] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2220 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1122720 (https://phabricator.wikimedia.org/T387270) (owner: 10Gerrit maintenance bot)
[06:30:13] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2153.codfw.wmnet
[06:30:21] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1195.eqiad.wmnet
[06:30:39] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1195.eqiad.wmnet with reason: Index rebuild
[06:30:42] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2153.codfw.wmnet with reason: Index rebuild
[06:38:05] <marostegui>	 !log Starting s7 codfw failover from db2218 to db2220 - T387270
[06:38:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:38:09] <stashbot>	 T387270: Switchover s7 master (db2218 -> db2220) - https://phabricator.wikimedia.org/T387270
[06:38:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s7 codfw as read-only for maintenance - T387270', diff saved to https://phabricator.wikimedia.org/P73605 and previous config saved to /var/cache/conftool/dbconfig/20250226-063817-root.json
[06:38:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2220 to s7 primary and set section read-write T387270', diff saved to https://phabricator.wikimedia.org/P73606 and previous config saved to /var/cache/conftool/dbconfig/20250226-063844-root.json
[06:39:30] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1122721 (https://phabricator.wikimedia.org/T387270) (owner: 10Gerrit maintenance bot)
[06:39:38] <logmsgbot>	 !log marostegui@dns1006 START - running authdns-update
[06:40:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2218 T387270', diff saved to https://phabricator.wikimedia.org/P73607 and previous config saved to /var/cache/conftool/dbconfig/20250226-064009-marostegui.json
[06:41:38] <logmsgbot>	 !log marostegui@dns1006 END - running authdns-update
[06:42:01] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2218.codfw.wmnet
[06:44:24] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1015.eqiad.wmnet with reason: Index rebuild
[06:45:07] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1173.eqiad.wmnet
[06:45:19] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2229.codfw.wmnet
[06:48:19] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2218.codfw.wmnet
[06:48:45] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2218.codfw.wmnet with reason: Index rebuild
[06:50:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1037', diff saved to https://phabricator.wikimedia.org/P73608 and previous config saved to /var/cache/conftool/dbconfig/20250226-065054-marostegui.json
[06:51:16] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for es1037.eqiad.wmnet
[06:51:49] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2229.codfw.wmnet
[06:51:55] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1173.eqiad.wmnet
[06:52:41] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1173.eqiad.wmnet with reason: Index rebuild
[06:52:43] <icinga-wm>	 PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:52:56] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2229.codfw.wmnet with reason: Index rebuild
[06:56:28] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[06:56:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1179 (T385645)', diff saved to https://phabricator.wikimedia.org/P73609 and previous config saved to /var/cache/conftool/dbconfig/20250226-065634-marostegui.json
[06:56:39] <stashbot>	 T385645: Drop event_variant column from echo_event - https://phabricator.wikimedia.org/T385645
[06:57:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T385645)', diff saved to https://phabricator.wikimedia.org/P73610 and previous config saved to /var/cache/conftool/dbconfig/20250226-065742-marostegui.json
[06:58:23] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es1037.eqiad.wmnet
[06:59:25] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1037.eqiad.wmnet with reason: Maintenance
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T0700)
[07:01:10] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:04:10] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:08:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73611 and previous config saved to /var/cache/conftool/dbconfig/20250226-070841-root.json
[07:10:40] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not reimage db1253 [puppet] - 10https://gerrit.wikimedia.org/r/1122859
[07:12:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P73612 and previous config saved to /var/cache/conftool/dbconfig/20250226-071248-marostegui.json
[07:13:06] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage db1253 [puppet] - 10https://gerrit.wikimedia.org/r/1122859 (owner: 10Marostegui)
[07:23:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73613 and previous config saved to /var/cache/conftool/dbconfig/20250226-072347-root.json
[07:24:53] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote es1035 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1122860 (https://phabricator.wikimedia.org/T387271)
[07:26:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es1035 with weight 0 T387271', diff saved to https://phabricator.wikimedia.org/P73614 and previous config saved to /var/cache/conftool/dbconfig/20250226-072600-root.json
[07:26:04] <stashbot>	 T387271: Switchover es7 master (es1039 -> es1035) - https://phabricator.wikimedia.org/T387271
[07:26:05] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es7 T387271
[07:26:56] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote es1035 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1122860 (https://phabricator.wikimedia.org/T387271) (owner: 10Gerrit maintenance bot)
[07:27:26] <marostegui>	 !log Starting es7 eqiad failover from es1039 to es1035 - T387271
[07:27:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:27:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1035 to es7 primary and set section read-write T387271', diff saved to https://phabricator.wikimedia.org/P73615 and previous config saved to /var/cache/conftool/dbconfig/20250226-072751-root.json
[07:28:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P73616 and previous config saved to /var/cache/conftool/dbconfig/20250226-072802-marostegui.json
[07:28:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1039 T387271', diff saved to https://phabricator.wikimedia.org/P73617 and previous config saved to /var/cache/conftool/dbconfig/20250226-072845-root.json
[07:29:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Give some weight to es1035', diff saved to https://phabricator.wikimedia.org/P73618 and previous config saved to /var/cache/conftool/dbconfig/20250226-072908-marostegui.json
[07:30:41] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for es1039.eqiad.wmnet
[07:36:10] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es1039.eqiad.wmnet
[07:37:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1024.eqiad.wmnet
[07:37:52] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10581646 (10ops-monitoring-bot) Draining ganeti1024.eqiad.wmnet of running VMs
[07:38:23] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1039.eqiad.wmnet with reason: maintenance
[07:38:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73619 and previous config saved to /var/cache/conftool/dbconfig/20250226-073852-root.json
[07:39:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73620 and previous config saved to /var/cache/conftool/dbconfig/20250226-073901-root.json
[07:41:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1024.eqiad.wmnet
[07:41:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] workaround for T256098 [debs/benthos] - 10https://gerrit.wikimedia.org/r/1122557 (https://phabricator.wikimedia.org/T256098) (owner: 10Fabfur)
[07:42:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1024.eqiad.wmnet
[07:42:53] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote es1037 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1122894 (https://phabricator.wikimedia.org/T387273)
[07:43:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T385645)', diff saved to https://phabricator.wikimedia.org/P73621 and previous config saved to /var/cache/conftool/dbconfig/20250226-074309-marostegui.json
[07:43:13] <stashbot>	 T385645: Drop event_variant column from echo_event - https://phabricator.wikimedia.org/T385645
[07:43:25] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1216.eqiad.wmnet with reason: Maintenance
[07:43:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ganeti1024 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1122895
[07:43:41] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1224.eqiad.wmnet with reason: Maintenance
[07:43:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1224 (T385645)', diff saved to https://phabricator.wikimedia.org/P73622 and previous config saved to /var/cache/conftool/dbconfig/20250226-074347-marostegui.json
[07:44:23] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10581670 (10ops-monitoring-bot) Draining ganeti1024.eqiad.wmnet of running VMs
[07:44:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T385645)', diff saved to https://phabricator.wikimedia.org/P73623 and previous config saved to /var/cache/conftool/dbconfig/20250226-074455-marostegui.json
[07:52:12] <wikibugs>	 (03Abandoned) 10Slyngshede: C:prometheus::process_exporter Add a simplistic process exporter. [puppet] - 10https://gerrit.wikimedia.org/r/1004672 (owner: 10Slyngshede)
[07:52:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73624 and previous config saved to /var/cache/conftool/dbconfig/20250226-075251-root.json
[07:53:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1224 T385645', diff saved to https://phabricator.wikimedia.org/P73625 and previous config saved to /var/cache/conftool/dbconfig/20250226-075347-root.json
[07:53:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73626 and previous config saved to /var/cache/conftool/dbconfig/20250226-075357-root.json
[07:54:01] <stashbot>	 T385645: Drop event_variant column from echo_event - https://phabricator.wikimedia.org/T385645
[07:54:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73627 and previous config saved to /var/cache/conftool/dbconfig/20250226-075406-root.json
[07:56:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73628 and previous config saved to /var/cache/conftool/dbconfig/20250226-075629-root.json
[08:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T0800).
[08:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:06:29] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:06:29] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:06:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2229 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73629 and previous config saved to /var/cache/conftool/dbconfig/20250226-080656-root.json
[08:07:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73630 and previous config saved to /var/cache/conftool/dbconfig/20250226-080757-root.json
[08:09:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73631 and previous config saved to /var/cache/conftool/dbconfig/20250226-080903-root.json
[08:09:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73632 and previous config saved to /var/cache/conftool/dbconfig/20250226-080911-root.json
[08:11:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73633 and previous config saved to /var/cache/conftool/dbconfig/20250226-081134-root.json
[08:22:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2229 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73634 and previous config saved to /var/cache/conftool/dbconfig/20250226-082201-root.json
[08:23:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73635 and previous config saved to /var/cache/conftool/dbconfig/20250226-082302-root.json
[08:23:14] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] wdqs: Create DNS entry for one full graph host [dns] - 10https://gerrit.wikimedia.org/r/1122676 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[08:24:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73636 and previous config saved to /var/cache/conftool/dbconfig/20250226-082417-root.json
[08:26:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73637 and previous config saved to /var/cache/conftool/dbconfig/20250226-082640-root.json
[08:27:45] <wikibugs>	 (03CR) 10Vgutierrez: hiera: Reimage lvs7002 as liberica LB [puppet] - 10https://gerrit.wikimedia.org/r/1122623 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[08:28:20] <wikibugs>	 (03CR) 10Volans: Expose _gql_execute to wmf-netbox (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi)
[08:30:48] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Reimage lvs7002 as liberica LB [puppet] - 10https://gerrit.wikimedia.org/r/1122623 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[08:31:35] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:32:55] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gerrit: give it more time to terminate [puppet] - 10https://gerrit.wikimedia.org/r/1112011 (https://phabricator.wikimedia.org/T323754) (owner: 10Hashar)
[08:33:34] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs7002.magru.wmnet with OS bookworm
[08:34:21] <wikibugs>	 (03CR) 10Volans: Netbox: fetch GQL queries from files (033 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1122133 (owner: 10Ayounsi)
[08:36:59] <icinga-wm>	 PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:37:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2229 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73638 and previous config saved to /var/cache/conftool/dbconfig/20250226-083706-root.json
[08:37:44] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] sre.gitlab.upgrade: add a prompt before backups on replica [cookbooks] - 10https://gerrit.wikimedia.org/r/1122520 (owner: 10Jelto)
[08:38:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73639 and previous config saved to /var/cache/conftool/dbconfig/20250226-083807-root.json
[08:39:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73640 and previous config saved to /var/cache/conftool/dbconfig/20250226-083922-root.json
[08:39:35] <vgutierrez>	 BGP alert isme reimaging lvs7002
[08:41:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job pybal in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:41:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73641 and previous config saved to /var/cache/conftool/dbconfig/20250226-084145-root.json
[08:44:15] <wikibugs>	 (03Merged) 10jenkins-bot: sre.gitlab.upgrade: add a prompt before backups on replica [cookbooks] - 10https://gerrit.wikimedia.org/r/1122520 (owner: 10Jelto)
[08:47:00] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] clone.py: Add helper functions for later use [cookbooks] - 10https://gerrit.wikimedia.org/r/1120213 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto)
[08:47:04] <wikibugs>	 (03CR) 10Elukey: "Yes definitely!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan)
[08:47:50] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] clone.py: Cleanup, extract fqdn and hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/1120214 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto)
[08:48:57] <wikibugs>	 (03PS1) 10Slyngshede: C:idm::deployment cleanup expired signup objects [puppet] - 10https://gerrit.wikimedia.org/r/1122898
[08:48:59] <wikibugs>	 (03PS1) 10Hashar: gerrit: remove explicit UseG1GC flag [puppet] - 10https://gerrit.wikimedia.org/r/1122899 (https://phabricator.wikimedia.org/T387223)
[08:49:33] <wikibugs>	 (03PS2) 10Elukey: kserve-inference: remove the need for the kserve container's securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122636 (https://phabricator.wikimedia.org/T369493)
[08:51:24] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4983/console" [puppet] - 10https://gerrit.wikimedia.org/r/1122898 (owner: 10Slyngshede)
[08:51:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job pybal in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:51:42] <wikibugs>	 (03CR) 10Jelto: "looks mostly good, two comments in-line" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122678 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[08:52:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2229 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73642 and previous config saved to /var/cache/conftool/dbconfig/20250226-085212-root.json
[08:52:13] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4985/console" [puppet] - 10https://gerrit.wikimedia.org/r/1122898 (owner: 10Slyngshede)
[08:53:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73643 and previous config saved to /var/cache/conftool/dbconfig/20250226-085312-root.json
[08:55:03] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs7002.magru.wmnet with reason: host reimage
[08:55:39] <wikibugs>	 (03CR) 10Jelto: wdqs: add routing for legacy full graph host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[08:56:20] <wikibugs>	 (03PS2) 10Slyngshede: C:idm::deployment cleanup expired signup objects [puppet] - 10https://gerrit.wikimedia.org/r/1122898
[08:56:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73644 and previous config saved to /var/cache/conftool/dbconfig/20250226-085650-root.json
[08:57:15] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4987/co" [puppet] - 10https://gerrit.wikimedia.org/r/1122898 (owner: 10Slyngshede)
[08:58:37] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs7002.magru.wmnet with reason: host reimage
[08:59:28] <wikibugs>	 (03PS3) 10Slyngshede: C:idm::deployment cleanup expired signup objects [puppet] - 10https://gerrit.wikimedia.org/r/1122898
[09:00:04] <jouncebot>	 dduvall and andre: Time to do the MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T0900).
[09:01:13] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2075.codfw.wmnet with OS bullseye
[09:01:13] <wikibugs>	 (03PS6) 10Volans: Fix CI reported issues [software/homer] - 10https://gerrit.wikimedia.org/r/1121370 (owner: 10Ayounsi)
[09:01:19] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10581801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye
[09:01:29] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1019.eqiad.wmnet with reason: Index rebuild
[09:01:38] <wikibugs>	 (03CR) 10Volans: "@Arzhel, as agreed on IRC I took over the patch and fixed all the reported issues." [software/homer] - 10https://gerrit.wikimedia.org/r/1121370 (owner: 10Ayounsi)
[09:01:44] <wikibugs>	 (03PS4) 10Slyngshede: C:idm::deployment cleanup expired signup objects [puppet] - 10https://gerrit.wikimedia.org/r/1122898
[09:02:30] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4991/co" [puppet] - 10https://gerrit.wikimedia.org/r/1122898 (owner: 10Slyngshede)
[09:03:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1180 db2169', diff saved to https://phabricator.wikimedia.org/P73645 and previous config saved to /var/cache/conftool/dbconfig/20250226-090323-root.json
[09:03:35] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1180.eqiad.wmnet
[09:04:38] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4992/co" [puppet] - 10https://gerrit.wikimedia.org/r/1122898 (owner: 10Slyngshede)
[09:06:57] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2075.codfw.wmnet with OS bullseye
[09:07:06] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10581814 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors: - ms-be2075...
[09:07:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2229 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73646 and previous config saved to /var/cache/conftool/dbconfig/20250226-090717-root.json
[09:07:26] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2169.codfw.wmnet
[09:08:34] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2075.codfw.wmnet with OS bullseye
[09:08:48] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10581816 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye
[09:09:01] <icinga-wm>	 RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:09:51] <wikibugs>	 (03CR) 10Ayounsi: Expose _gql_execute to wmf-netbox (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi)
[09:10:02] <wikibugs>	 (03PS1) 10Hashar: php: use component/pcre2 when using Php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1122901 (https://phabricator.wikimedia.org/T387276)
[09:10:26] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1180.eqiad.wmnet
[09:10:35] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4995/co" [puppet] - 10https://gerrit.wikimedia.org/r/1122898 (owner: 10Slyngshede)
[09:10:55] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1180.eqiad.wmnet with reason: Index rebuild
[09:11:46] <wikibugs>	 (03CR) 10Volans: Expose _gql_execute to wmf-netbox (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi)
[09:13:08] <wikibugs>	 (03PS1) 10Jgiannelos: pcs: Enable more wikis for native PCS pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122902
[09:13:41] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2169.codfw.wmnet
[09:13:51] <wikibugs>	 (03CR) 10Jgiannelos: "I added another round of rollouts with more wikis this time." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122902 (owner: 10Jgiannelos)
[09:14:03] <icinga-wm>	 PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:14:16] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2169.codfw.wmnet with reason: Index rebuild
[09:16:01] <icinga-wm>	 RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:16:04] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2075.codfw.wmnet with OS bullseye
[09:16:11] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10581834 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors: - ms-be2075...
[09:18:59] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs7002.magru.wmnet with OS bookworm
[09:19:18] <wikibugs>	 (03PS2) 10Jgiannelos: pcs: Enable more wikis for native PCS pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122902 (https://phabricator.wikimedia.org/T387277)
[09:20:01] <wikibugs>	 (03PS3) 10Ayounsi: Expose _gql_execute to wmf-netbox + fetch GQL queries from files [software/homer] - 10https://gerrit.wikimedia.org/r/1094291
[09:20:04] <wikibugs>	 (03PS2) 10Hashar: php: use component/pcre2 when using Php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1122901 (https://phabricator.wikimedia.org/T387276)
[09:23:47] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[09:23:55] <wikibugs>	 (03PS4) 10Ayounsi: Expose _gql_execute to wmf-netbox + fetch GQL queries from files [software/homer] - 10https://gerrit.wikimedia.org/r/1094291
[09:24:26] <wikibugs>	 (03CR) 10Ayounsi: "Commit squashed in previous one. Replying to comments here." [software/homer] - 10https://gerrit.wikimedia.org/r/1122133 (owner: 10Ayounsi)
[09:26:00] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "Thanks!" [software/homer] - 10https://gerrit.wikimedia.org/r/1121370 (owner: 10Ayounsi)
[09:26:23] <wikibugs>	 (03CR) 10Volans: [C:03+2] Fix CI reported issues [software/homer] - 10https://gerrit.wikimedia.org/r/1121370 (owner: 10Ayounsi)
[09:28:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Expose _gql_execute to wmf-netbox + fetch GQL queries from files [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi)
[09:32:43] <wikibugs>	 (03Merged) 10jenkins-bot: Fix CI reported issues [software/homer] - 10https://gerrit.wikimedia.org/r/1121370 (owner: 10Ayounsi)
[09:35:03] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2075.codfw.wmnet with OS bullseye
[09:35:13] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10581870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye
[09:39:13] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2075.codfw.wmnet with OS bullseye
[09:39:19] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10581872 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors: - ms-be2075...
[09:41:42] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10581888 (10elukey) I tried multiple times to run reimage but the host doesn't PXE boot, not sure why, I tried to follow the console com2 as well but no clear error highlighted.
[09:44:46] <wikibugs>	 (03PS2) 10Muehlenhoff: php: use component/pcre2 when using Php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1122901 (https://phabricator.wikimedia.org/T387276) (owner: 10Hashar)
[09:47:47] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: restore lvs7002 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1122905 (https://phabricator.wikimedia.org/T384477)
[09:52:43] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122905 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[09:52:44] <hashar>	 !log Restarting Gerrit on gerrit2002 and gerrit1003
[09:52:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:40] <hashar>	 Feb 26 08:37:17 gerrit1003 systemd[1]: /lib/systemd/system/gerrit.service:16: Unknown key name 'TimeOutStopSec' in section 'Service', ignoring.
[09:53:41] <hashar>	 pff :/
[09:54:22] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10581920 (10MoritzMuehlenhoff) >>! In T383723#10579629, @Andrew wrote: > @MoritzMuehlenhoff ping, is ganeti1044 ready to be moved?  Not yet, I'l...
[09:54:27] <hashar>	 anyway it has restarted
[09:56:30] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122905 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[09:57:01] <wikibugs>	 (03PS1) 10Hashar: gerrit: fix systemd service TimeoutStopSec [puppet] - 10https://gerrit.wikimedia.org/r/1122906 (https://phabricator.wikimedia.org/T323754)
[09:58:13] <wikibugs>	 (03CR) 10Hashar: gerrit: give it more time to terminate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1112011 (https://phabricator.wikimedia.org/T323754) (owner: 10Hashar)
[09:58:31] <jelto>	 hashar: I'm already preparing a fix to change the timeout steting
[09:58:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1122906 (https://phabricator.wikimedia.org/T323754) (owner: 10Hashar)
[09:59:09] <wikibugs>	 (03CR) 10Jelto: [C:03+2] "ah, you are faster, this looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1122906 (https://phabricator.wikimedia.org/T323754) (owner: 10Hashar)
[09:59:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73647 and previous config saved to /var/cache/conftool/dbconfig/20250226-095917-root.json
[10:00:16] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: restore lvs7002 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1122905 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[10:01:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73648 and previous config saved to /var/cache/conftool/dbconfig/20250226-100148-root.json
[10:02:26] <wikibugs>	 (03CR) 10Vgutierrez: hiera: Reimage lvs7001 as liberica LB [puppet] - 10https://gerrit.wikimedia.org/r/1122624 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[10:02:34] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Reimage lvs7001 as liberica LB [puppet] - 10https://gerrit.wikimedia.org/r/1122624 (https://phabricator.wikimedia.org/T384477)
[10:05:25] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122624 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[10:08:55] <vgutierrez>	 !log depooling lvs7001 before reimaging - T384477
[10:08:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:59] <stashbot>	 T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477
[10:09:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [debs/benthos] - 10https://gerrit.wikimedia.org/r/1122557 (https://phabricator.wikimedia.org/T256098) (owner: 10Fabfur)
[10:09:30] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Reimage lvs7001 as liberica LB [puppet] - 10https://gerrit.wikimedia.org/r/1122624 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[10:11:15] <wikibugs>	 (03CR) 10Hashar: php: use component/pcre2 when using Php 8.1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1122901 (https://phabricator.wikimedia.org/T387276) (owner: 10Hashar)
[10:11:26] <wikibugs>	 (03PS3) 10Hashar: php: use component/pcre2 when using Php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1122901 (https://phabricator.wikimedia.org/T387276)
[10:11:35] <icinga-wm>	 PROBLEM - pybal on lvs7001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[10:11:43] <fabfur>	 thanks moritzm , I was just contacting you for suggestion on that! :D 
[10:11:59] <icinga-wm>	 PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:12:11] <vgutierrez>	 ^^ that's lvs7001 depooled
[10:12:15] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs7001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[10:12:58] <moritzm>	 :-)
[10:13:19] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs7001 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=8) https://wikitech.wikimedia.org/wiki/PyBal
[10:13:21] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:13:47] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] workaround for T256098 [debs/benthos] - 10https://gerrit.wikimedia.org/r/1122557 (https://phabricator.wikimedia.org/T256098) (owner: 10Fabfur)
[10:13:51] <wikibugs>	 (03CR) 10Fabfur: [V:03+2 C:03+2] workaround for T256098 [debs/benthos] - 10https://gerrit.wikimedia.org/r/1122557 (https://phabricator.wikimedia.org/T256098) (owner: 10Fabfur)
[10:14:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73649 and previous config saved to /var/cache/conftool/dbconfig/20250226-101401-root.json
[10:14:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73650 and previous config saved to /var/cache/conftool/dbconfig/20250226-101422-root.json
[10:15:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job pybal in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:16:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73651 and previous config saved to /var/cache/conftool/dbconfig/20250226-101654-root.json
[10:17:49] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10581998 (10cmooney) >>! In T384731#10579181, @ayounsi wrote: >>> And what happens if peer_descr is missing or empt...
[10:18:08] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs7001.magru.wmnet with OS bookworm
[10:24:32] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] pcs: Enable more wikis for native PCS pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122902 (https://phabricator.wikimedia.org/T387277) (owner: 10Jgiannelos)
[10:28:27] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] kserve-inference: remove the need for the kserve container's securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122636 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[10:29:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73652 and previous config saved to /var/cache/conftool/dbconfig/20250226-102906-root.json
[10:29:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73653 and previous config saved to /var/cache/conftool/dbconfig/20250226-102927-root.json
[10:32:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73654 and previous config saved to /var/cache/conftool/dbconfig/20250226-103159-root.json
[10:36:12] <wikibugs>	 (03CR) 10Clément Goubert: mwscript: do not run mesh checks when running in a loop (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1122606 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto)
[10:36:58] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Index rebuild
[10:39:10] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs7001.magru.wmnet with reason: host reimage
[10:39:32] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Index rebuild
[10:42:21] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs7001.magru.wmnet with reason: host reimage
[10:42:35] <wikibugs>	 (03PS1) 10Fabfur: benthos: use hasty mode to avoid eventgate blocking http requests [puppet] - 10https://gerrit.wikimedia.org/r/1122917 (https://phabricator.wikimedia.org/T329332)
[10:43:21] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:44:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73655 and previous config saved to /var/cache/conftool/dbconfig/20250226-104359-root.json
[10:44:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73656 and previous config saved to /var/cache/conftool/dbconfig/20250226-104411-root.json
[10:44:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73658 and previous config saved to /var/cache/conftool/dbconfig/20250226-104433-root.json
[10:46:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73659 and previous config saved to /var/cache/conftool/dbconfig/20250226-104619-root.json
[10:46:25] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] benthos: use hasty mode to avoid eventgate blocking http requests [puppet] - 10https://gerrit.wikimedia.org/r/1122917 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur)
[10:47:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73660 and previous config saved to /var/cache/conftool/dbconfig/20250226-104704-root.json
[10:48:26] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] airflow-analytics-product: migrate the scheduler and the DB to Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122591 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol)
[10:49:01] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] airflow-analytics-product: disable and remove the airflow systemd services [puppet] - 10https://gerrit.wikimedia.org/r/1122592 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol)
[10:50:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job pybal in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:51:49] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] benthos: use hasty mode to avoid eventgate blocking http requests [puppet] - 10https://gerrit.wikimedia.org/r/1122917 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur)
[10:52:05] <icinga-wm>	 RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:52:25] <wikibugs>	 (03PS1) 10Gkyziridis: inference-services: deployment for edit-check dummy model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100)
[10:52:41] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Restore lvs7001 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1122919 (https://phabricator.wikimedia.org/T384477)
[10:53:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] inference-services: deployment for edit-check dummy model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[10:54:15] <wikibugs>	 (03CR) 10Volans: "The other CR might be abandoned at this point I guess" [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi)
[10:55:50] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 10Release-Engineering-Team (Priority Backlog 📥): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629#10582102 (10JMeybohm) I stumbled upon this again recently and I think the current configuration does not allow pod creation at...
[10:59:05] <icinga-wm>	 PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:59:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73661 and previous config saved to /var/cache/conftool/dbconfig/20250226-105905-root.json
[10:59:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73662 and previous config saved to /var/cache/conftool/dbconfig/20250226-105916-root.json
[10:59:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73663 and previous config saved to /var/cache/conftool/dbconfig/20250226-105937-root.json
[10:59:53] <marostegui>	 !log Drop schema change on s3 codfw master with replication dbmaint T385645
[10:59:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:56] <stashbot>	 T385645: Drop event_variant column from echo_event - https://phabricator.wikimedia.org/T385645
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T1100)
[11:01:05] <icinga-wm>	 RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:01:17] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10582145 (10ayounsi) I forked the discussion to {T387287} and {T387288} as that task was becoming more difficult to...
[11:01:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73664 and previous config saved to /var/cache/conftool/dbconfig/20250226-110124-root.json
[11:01:37] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10582148 (10ayounsi)
[11:02:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73665 and previous config saved to /var/cache/conftool/dbconfig/20250226-110209-root.json
[11:03:37] <marostegui>	 !log Drop schema change on s7 codfw master with replication dbmaint T385645
[11:03:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:14] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs7001.magru.wmnet with OS bookworm
[11:04:39] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Schema change
[11:14:05] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122919 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[11:14:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73666 and previous config saved to /var/cache/conftool/dbconfig/20250226-111410-root.json
[11:14:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73667 and previous config saved to /var/cache/conftool/dbconfig/20250226-111421-root.json
[11:15:41] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Restore lvs7001 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1122919 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[11:16:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73668 and previous config saved to /var/cache/conftool/dbconfig/20250226-111629-root.json
[11:20:12] <jinxer-wm>	 FIRING: ProbeDown: Service aux-k8s-ctrl1002:6443 has failed probes (http_aux_k8s_eqiad_kube_apiserver_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#aux-k8s-ctrl1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:20:25] <wikibugs>	 (03CR) 10Slyngshede: [C:04-1] "Change of plans, sorry. This will be rolled into https://phabricator.wikimedia.org/T341581." [puppet] - 10https://gerrit.wikimedia.org/r/1070563 (https://phabricator.wikimedia.org/T373702) (owner: 10Slyngshede)
[11:20:28] <vgutierrez>	 !incidents
[11:20:28] <sirenbot>	 5699 (UNACKED)  ProbeDown sre (2620:0:861:101:10:64:0:107 ip6 aux-k8s-ctrl1002:6443 probes/custom http_aux_k8s_eqiad_kube_apiserver_ip6 eqiad)
[11:20:31] <vgutierrez>	 !ack 5699
[11:20:31] <sirenbot>	 5699 (ACKED)  ProbeDown sre (2620:0:861:101:10:64:0:107 ip6 aux-k8s-ctrl1002:6443 probes/custom http_aux_k8s_eqiad_kube_apiserver_ip6 eqiad)
[11:20:50] <vgutierrez>	 I'm guessing that's unexpected? :)
[11:21:07] <jayme>	 nothing obvious in the above lines here...so maaaybe :)
[11:21:46] <vgutierrez>	 !log repooling lvs7001 running liberica - T384477
[11:21:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:50] <stashbot>	 T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477
[11:22:07] <jayme>	 but it's reachable via ipv4 at least
[11:25:12] <jinxer-wm>	 RESOLVED: ProbeDown: Service aux-k8s-ctrl1002:6443 has failed probes (http_aux_k8s_eqiad_kube_apiserver_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#aux-k8s-ctrl1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:25:30] <vgutierrez>	 and back to normal 
[11:26:37] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:26:43] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:29:10] <wikibugs>	 (03PS1) 10Effie Mouzeli: shellbox: all replicas on PHP 8.1 (media & timeline) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122923 (https://phabricator.wikimedia.org/T377038)
[11:29:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73669 and previous config saved to /var/cache/conftool/dbconfig/20250226-112915-root.json
[11:31:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73670 and previous config saved to /var/cache/conftool/dbconfig/20250226-113134-root.json
[11:32:22] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: sync
[11:32:27] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: sync
[11:33:43] <wikibugs>	 (03CR) 10Gkyziridis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[11:33:45] <wikibugs>	 (03CR) 10Gkyziridis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[11:34:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1168 db2180 T386242', diff saved to https://phabricator.wikimedia.org/P73671 and previous config saved to /var/cache/conftool/dbconfig/20250226-113453-root.json
[11:34:58] <stashbot>	 T386242: Upgrade and rebuild s6 - https://phabricator.wikimedia.org/T386242
[11:36:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Pool db2169 with 100%', diff saved to https://phabricator.wikimedia.org/P73673 and previous config saved to /var/cache/conftool/dbconfig/20250226-113613-marostegui.json
[11:37:05] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2180.codfw.wmnet
[11:37:14] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] shellbox: all replicas on PHP 8.1 (media & timeline) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122923 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli)
[11:37:16] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] shellbox: all replicas on PHP 8.1 (media & timeline) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122923 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli)
[11:37:24] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1168.eqiad.wmnet
[11:37:59] <wikibugs>	 (03PS2) 10Gkyziridis: inference-services: deployment for edit-check dummy model.  - Fix typo in edit-check folder. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100)
[11:39:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] inference-services: deployment for edit-check dummy model.  - Fix typo in edit-check folder. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[11:39:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1206 db2170 T385561', diff saved to https://phabricator.wikimedia.org/P73674 and previous config saved to /var/cache/conftool/dbconfig/20250226-113935-root.json
[11:39:40] <stashbot>	 T385561: Upgrade and rebuild s1 - https://phabricator.wikimedia.org/T385561
[11:39:44] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2170.codfw.wmnet
[11:39:50] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1206.eqiad.wmnet
[11:41:43] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] shellbox: all replicas on PHP 8.1 (media & timeline) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122923 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli)
[11:42:23] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10582448 (10Vgutierrez) p:05Triage→03Medium
[11:42:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1024.eqiad.wmnet
[11:42:53] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox: all replicas on PHP 8.1 (media & timeline) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122923 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli)
[11:43:28] <wikibugs>	 (03PS3) 10Gkyziridis: inference-services: deployment for edit-check dummy model.  - Fix typo in edit-check folder.  - Add newest image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100)
[11:43:39] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2180.codfw.wmnet
[11:44:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] inference-services: deployment for edit-check dummy model.  - Fix typo in edit-check folder.  - Add newest image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[11:45:01] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply
[11:45:24] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1168.eqiad.wmnet
[11:45:33] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2180.codfw.wmnet with reason: Index rebuild
[11:45:38] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply
[11:45:49] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1168.eqiad.wmnet with reason: Index rebuild
[11:45:59] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2170.codfw.wmnet
[11:46:22] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2170.codfw.wmnet with reason: Index rebuild
[11:46:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73675 and previous config saved to /var/cache/conftool/dbconfig/20250226-114640-root.json
[11:48:27] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1206.eqiad.wmnet
[11:49:32] <vgutierrez>	 !log uploaded gobgpd 3.33 to apt.wm.o (bookworm-wikimedia) - T386687
[11:49:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:49:36] <stashbot>	 T386687: backport gobgp 3.33 from trixie - https://phabricator.wikimedia.org/T386687
[11:49:37] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1206.eqiad.wmnet with reason: Index rebuild
[11:49:46] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply
[11:50:47] <wikibugs>	 (03PS4) 10Gkyziridis: inference-services: deployment for edit-check dummy model.  - Fix typo in edit-check folder.  - Add newest image version  - Try to fix the failing linting step [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100)
[11:51:43] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply
[11:52:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] inference-services: deployment for edit-check dummy model.  - Fix typo in edit-check folder.  - Add newest image version  - Try to fix the failing linting step [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[11:52:43] <wikibugs>	 (03PS1) 10Vgutierrez: cumin: Remove lvs-magru alias [puppet] - 10https://gerrit.wikimedia.org/r/1122926 (https://phabricator.wikimedia.org/T384477)
[11:53:27] <wikibugs>	 (03CR) 10Klausman: [C:03+1] "So, to summarize: this change is basically a no-op in prod, and when the patched version of kserve goes to prod, we get what the original " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122636 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[11:53:28] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, do we need equivalent aliases for liberica?" [puppet] - 10https://gerrit.wikimedia.org/r/1122926 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[11:54:13] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1024.eqiad.wmnet with reason: remove from cluster for reimage
[11:54:20] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10582490 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=dce06e0b-27de-4e76-8cf6-d4947764ef79) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(...
[11:54:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1044.eqiad.wmnet
[11:55:10] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10582493 (10ops-monitoring-bot) Draining ganeti1044.eqiad.wmnet of running VMs
[11:55:26] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] cumin: Remove lvs-magru alias [puppet] - 10https://gerrit.wikimedia.org/r/1122926 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[11:56:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1044.eqiad.wmnet
[11:59:54] <wikibugs>	 (03PS5) 10Gkyziridis: inference-services: deployment for edit-check dummy model.  - Fix typo in edit-check folder.  - Add newest image version  - Try to fix the failing linting step  - Copy hooks from readability model [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100)
[11:59:56] <wikibugs>	 (03CR) 10Elukey: "This change removes the extra security context from isvcs, so it should regenerate them etc.. but practically yes, these extra bits are ha" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122636 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[12:00:05] <jouncebot>	 mvolz: I, the Bot under the Fountain, call upon thee, The Deployer, to do Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T1200).
[12:01:56] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1155.eqiad.wmnet with reason: Index rebuild
[12:02:31] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply
[12:02:55] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply
[12:03:21] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:04:08] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122668 (owner: 10PipelineBot)
[12:04:08] <wikibugs>	 (03CR) 10Klausman: [C:03+1] "Ack, ty!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122636 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[12:05:18] <wikibugs>	 (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122668 (owner: 10PipelineBot)
[12:06:23] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply
[12:06:38] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply
[12:06:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es1037 with weight 0 T387273', diff saved to https://phabricator.wikimedia.org/P73676 and previous config saved to /var/cache/conftool/dbconfig/20250226-120649-root.json
[12:06:51] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es6 T387273
[12:06:53] <stashbot>	 T387273: Switchover es6 master (es1038 -> es1037) - https://phabricator.wikimedia.org/T387273
[12:07:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1044.eqiad.wmnet
[12:07:13] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote es1037 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1122894 (https://phabricator.wikimedia.org/T387273) (owner: 10Gerrit maintenance bot)
[12:07:19] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10582517 (10ops-monitoring-bot) Draining ganeti1044.eqiad.wmnet of running VMs
[12:07:44] <marostegui>	 !log Starting es6 eqiad failover from es1038 to es1037 - T387273
[12:07:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1024.eqiad.wmnet
[12:07:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1037 to es6 primary and set section read-write T387273', diff saved to https://phabricator.wikimedia.org/P73677 and previous config saved to /var/cache/conftool/dbconfig/20250226-120806-root.json
[12:08:21] <logmsgbot>	 !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply
[12:08:44] <logmsgbot>	 !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply
[12:08:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1038 T387273', diff saved to https://phabricator.wikimedia.org/P73678 and previous config saved to /var/cache/conftool/dbconfig/20250226-120848-root.json
[12:09:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Add weight to es1037', diff saved to https://phabricator.wikimedia.org/P73679 and previous config saved to /var/cache/conftool/dbconfig/20250226-120925-root.json
[12:10:03] <logmsgbot>	 !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply
[12:10:25] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for es1038.eqiad.wmnet
[12:10:32] <logmsgbot>	 !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply
[12:10:44] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-analytics-product: migrate the scheduler and the DB to Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122591 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol)
[12:13:48] <logmsgbot>	 !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply
[12:14:17] <logmsgbot>	 !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply
[12:19:08] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es1038.eqiad.wmnet
[12:20:45] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122927
[12:23:18] <wikibugs>	 (03CR) 10Gkyziridis: inference-services: deployment for edit-check dummy model.  - Fix typo in edit-check folder.  - Add newest image version  - Try to fix the f (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[12:24:15] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1038.eqiad.wmnet with reason: Index rebuild
[12:27:31] <wikibugs>	 (03PS1) 10Muehlenhoff: Blacklist hfs/hfsplus [puppet] - 10https://gerrit.wikimedia.org/r/1122929
[12:29:07] <wikibugs>	 (03Abandoned) 10Ayounsi: Netbox: fetch GQL queries from files [software/homer] - 10https://gerrit.wikimedia.org/r/1122133 (owner: 10Ayounsi)
[12:33:21] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:33:53] <wikibugs>	 (03CR) 10Ayounsi: Expose _gql_execute to wmf-netbox + fetch GQL queries from files (034 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi)
[12:34:05] <wikibugs>	 (03PS5) 10Ayounsi: Expose _gql_execute to wmf-netbox + fetch GQL queries from files [software/homer] - 10https://gerrit.wikimedia.org/r/1094291
[12:34:51] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3643 MB (3% inode=98%): /tmp 3643 MB (3% inode=98%): /var/tmp 3643 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[12:36:09] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host backup1014.eqiad.wmnet with OS bookworm
[12:36:19] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10582628 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm
[12:38:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73680 and previous config saved to /var/cache/conftool/dbconfig/20250226-123800-root.json
[12:39:00] <wikibugs>	 (03CR) 10Kevin Bazira: "thank you for working on this, George. usually, with model-servers that are still in the experimental phase (like this edit-check), we dep" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[12:40:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Expose _gql_execute to wmf-netbox + fetch GQL queries from files [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi)
[12:44:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73681 and previous config saved to /var/cache/conftool/dbconfig/20250226-124433-root.json
[12:46:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73682 and previous config saved to /var/cache/conftool/dbconfig/20250226-124602-root.json
[12:46:21] <wikibugs>	 (03PS1) 10Hnowlan: mobileapps: scrape all ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122932 (https://phabricator.wikimedia.org/T372749)
[12:53:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1038 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73683 and previous config saved to /var/cache/conftool/dbconfig/20250226-125305-root.json
[12:57:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1024.eqiad.wmnet
[12:59:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73684 and previous config saved to /var/cache/conftool/dbconfig/20250226-125938-root.json
[13:01:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73685 and previous config saved to /var/cache/conftool/dbconfig/20250226-130107-root.json
[13:03:29] <wikibugs>	 (03PS6) 10Gkyziridis: inference-services: deployment for edit-check dummy model.  - Fix typo in edit-check folder.  - Add newest image version  - Try to fix the failing linting step  - Copy hooks from readability model  - Add edit-check under /experimental/values-ml-staging-codfw.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100)
[13:03:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:07:27] <wikibugs>	 (03CR) 10Gkyziridis: "Thnx for reviewing this patch Kevin. I am not sure if I understood completely your comment, should I remove the folder '/helm.d/ml-service" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[13:07:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1024.eqiad.wmnet
[13:07:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti1024.eqiad.wmnet
[13:08:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1038 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73686 and previous config saved to /var/cache/conftool/dbconfig/20250226-130810-root.json
[13:08:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:14:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73687 and previous config saved to /var/cache/conftool/dbconfig/20250226-131443-root.json
[13:16:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73688 and previous config saved to /var/cache/conftool/dbconfig/20250226-131612-root.json
[13:19:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[13:19:48] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1014.eqiad.wmnet with OS bookworm
[13:19:55] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10582738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm executed with errors: - backup1...
[13:23:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73689 and previous config saved to /var/cache/conftool/dbconfig/20250226-132315-root.json
[13:23:47] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[13:24:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[13:24:14] <vgutierrez>	 !log testing gobgp 3.33 in lvs1013 - T386687
[13:24:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:18] <stashbot>	 T386687: backport gobgp 3.33 from trixie - https://phabricator.wikimedia.org/T386687
[13:25:16] <wikibugs>	 (03CR) 10Kevin Bazira: "yes, for now the `helmfile.d/ml-services/edit-check/*` config doesn't have to be deployed since you'll likely end up using the `revision-m" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[13:26:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti1024 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1122895 (owner: 10Muehlenhoff)
[13:29:31] <icinga-wm>	 PROBLEM - Disk space on titan2001 is CRITICAL: DISK CRITICAL - free space: /srv 14884MiB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops
[13:29:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73690 and previous config saved to /var/cache/conftool/dbconfig/20250226-132948-root.json
[13:31:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73691 and previous config saved to /var/cache/conftool/dbconfig/20250226-133118-root.json
[13:33:21] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:34:51] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3554 MB (3% inode=98%): /tmp 3554 MB (3% inode=98%): /var/tmp 3554 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[13:36:03] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[13:36:32] <wikibugs>	 (03CR) 10Muehlenhoff: "Few things inline, otherwise LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1122562 (https://phabricator.wikimedia.org/T385947) (owner: 10Slyngshede)
[13:36:50] <wikibugs>	 (03CR) 10Ladsgroup: [C:04-1] "You probably can just adjust this patch to allow them only. Does that sound good?" [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04)
[13:36:52] <wikibugs>	 (03PS1) 10Brouberol: airflow-analytics-product: add missing database value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122944 (https://phabricator.wikimedia.org/T380623)
[13:38:04] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 29263
[13:38:08] <wikibugs>	 (03CR) 10Jennifer Ebe: [C:03+1] airflow-analytics-product: add missing database value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122944 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol)
[13:38:15] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-analytics-product: add missing database value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122944 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol)
[13:38:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73692 and previous config saved to /var/cache/conftool/dbconfig/20250226-133820-root.json
[13:38:43] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 29263
[13:38:52] <wikibugs>	 (03PS1) 10Marostegui: valid_sections.pp: Add ms1, ms2, and ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1122945 (https://phabricator.wikimedia.org/T387332)
[13:39:12] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply
[13:39:39] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "<3 <3 <3" [puppet] - 10https://gerrit.wikimedia.org/r/1122945 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui)
[13:40:04] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply
[13:40:10] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply
[13:41:07] <wikibugs>	 (03PS1) 10Ladsgroup: Set commons categorylinks migration to WRITE_BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122946 (https://phabricator.wikimedia.org/T385164)
[13:41:07] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] Re-enroll 5% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122655 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[13:44:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73693 and previous config saved to /var/cache/conftool/dbconfig/20250226-134453-root.json
[13:45:04] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] mobileapps: scrape all ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122932 (https://phabricator.wikimedia.org/T372749) (owner: 10Hnowlan)
[13:45:27] <wikibugs>	 (03PS1) 10Brouberol: airflow-analytics-product: remove import mode overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122948
[13:45:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1122898 (owner: 10Slyngshede)
[13:46:13] <wikibugs>	 (03CR) 10Jennifer Ebe: [C:03+1] airflow-analytics-product: remove import mode overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122948 (owner: 10Brouberol)
[13:46:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73694 and previous config saved to /var/cache/conftool/dbconfig/20250226-134623-root.json
[13:47:22] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-analytics-product: remove import mode overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122948 (owner: 10Brouberol)
[13:49:31] <icinga-wm>	 RECOVERY - Disk space on titan2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops
[13:52:17] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply
[13:52:58] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply
[13:54:54] <wikibugs>	 (03CR) 10Brouberol: [V:03+1 C:03+2] airflow-analytics-product: disable and remove the airflow systemd services [puppet] - 10https://gerrit.wikimedia.org/r/1122592 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol)
[13:57:37] <wikibugs>	 (03PS1) 10Slyngshede: Add option to delete a single signup [software/bitu] - 10https://gerrit.wikimedia.org/r/1122951
[13:57:45] <Amir1>	 !log dropped vote_log and arbcom1_vote tables on English Wikipedia (T376627)
[13:57:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:48] <stashbot>	 T376627: Drop ad-hoc or obsolete tables in production - https://phabricator.wikimedia.org/T376627
[13:59:08] <moritzm>	 !log installing tiff security updates
[13:59:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T1400).
[14:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[14:00:28] <Amir1>	 ooh nice
[14:01:00] <Amir1>	 Lucas_WMDE: would you merge core patches on master as part of the deployment window?
[14:01:09] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Set commons categorylinks migration to WRITE_BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122946 (https://phabricator.wikimedia.org/T385164) (owner: 10Ladsgroup)
[14:01:26] <Lucas_WMDE>	 *confused*
[14:01:32] <Lucas_WMDE>	 anyway, I can’t deploy today, sorry
[14:01:35] <Amir1>	 😈
[14:01:49] <Amir1>	 Deploy I can take care of myself :D
[14:02:02] <wikibugs>	 (03Merged) 10jenkins-bot: Set commons categorylinks migration to WRITE_BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122946 (https://phabricator.wikimedia.org/T385164) (owner: 10Ladsgroup)
[14:03:08] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1122946|Set commons categorylinks migration to WRITE_BOTH (T385164)]]
[14:03:12] <stashbot>	 T385164: Set categorylinks to write both - https://phabricator.wikimedia.org/T385164
[14:03:20] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[14:05:10] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Blacklist hfs/hfsplus [puppet] - 10https://gerrit.wikimedia.org/r/1122929 (owner: 10Muehlenhoff)
[14:06:22] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1122946|Set commons categorylinks migration to WRITE_BOTH (T385164)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:06:52] <wikibugs>	 (03PS4) 10Slyngshede: Ensure that the LDAP user is parsed as an Entry object. [software/bitu] - 10https://gerrit.wikimedia.org/r/1122562 (https://phabricator.wikimedia.org/T385947)
[14:07:28] <wikibugs>	 (03PS5) 10Slyngshede: Ensure that the LDAP user is parsed as an Entry object. [software/bitu] - 10https://gerrit.wikimedia.org/r/1122562 (https://phabricator.wikimedia.org/T385947)
[14:08:10] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[14:09:09] <wikibugs>	 (03PS6) 10Slyngshede: Ensure that the LDAP user is parsed as an Entry object. [software/bitu] - 10https://gerrit.wikimedia.org/r/1122562 (https://phabricator.wikimedia.org/T385947)
[14:10:05] <wikibugs>	 (03CR) 10Slyngshede: Ensure that the LDAP user is parsed as an Entry object. (035 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1122562 (https://phabricator.wikimedia.org/T385947) (owner: 10Slyngshede)
[14:10:47] <wikibugs>	 (03CR) 10Elukey: aux-k8s-ctrl codfw: apply role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1122170 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron)
[14:11:37] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: mediawiki: introduce feature flags (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto)
[14:12:27] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 (owner: 10Giuseppe Lavagetto)
[14:13:01] <wikibugs>	 (03CR) 10Muehlenhoff: "Three comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy)
[14:14:13] <wikibugs>	 (03PS3) 10Muehlenhoff: php: use component/pcre2 when using Php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1122901 (https://phabricator.wikimedia.org/T387276) (owner: 10Hashar)
[14:14:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1122901 (https://phabricator.wikimedia.org/T387276) (owner: 10Hashar)
[14:14:39] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122946|Set commons categorylinks migration to WRITE_BOTH (T385164)]] (duration: 11m 31s)
[14:14:40] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Observability-Logging, 13Patch-For-Review: decommission logstash102[6-9] - https://phabricator.wikimedia.org/T383287#10583004 (10colewhite)
[14:14:43] <stashbot>	 T385164: Set categorylinks to write both - https://phabricator.wikimedia.org/T385164
[14:14:52] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3636 MB (3% inode=98%): /tmp 3636 MB (3% inode=98%): /var/tmp 3636 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[14:14:55] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Observability-Logging, 13Patch-For-Review: decommission logstash102[6-9] - https://phabricator.wikimedia.org/T383287#10583009 (10colewhite) a:05colewhite→03None
[14:15:44] <wikibugs>	 (03CR) 10Elukey: "For the etcd k8s cluster it is unclear to me if we need a backup, we can probably raise the question to the kubernetes SIG and decide a st" [puppet] - 10https://gerrit.wikimedia.org/r/1120602 (https://phabricator.wikimedia.org/T385727) (owner: 10Herron)
[14:16:20] <wikibugs>	 (03CR) 10Elukey: [C:03+2] kserve-inference: remove the need for the kserve container's securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122636 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[14:18:43] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] site: clean up logstash102[6789] configs [puppet] - 10https://gerrit.wikimedia.org/r/1122691 (https://phabricator.wikimedia.org/T383287) (owner: 10Cwhite)
[14:19:01] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[14:20:32] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[14:21:02] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host backup1014.eqiad.wmnet with OS bookworm
[14:21:12] <icinga-wm>	 PROBLEM - BFD status on cr1-magru is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:22:08] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[14:22:10] <wikibugs>	 (03CR) 10Jforrester: Deduplicate JsonConfig config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 (owner: 10Bartosz Dziewoński)
[14:22:35] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] Remove unused config variable $wgJsonConfigInterwikiPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122683 (owner: 10Bartosz Dziewoński)
[14:23:12] <icinga-wm>	 RECOVERY - BFD status on cr1-magru is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:24:03] <wikibugs>	 (03PS1) 10Ayounsi: Add exporter port to gNMI metrics instance label [puppet] - 10https://gerrit.wikimedia.org/r/1122955 (https://phabricator.wikimedia.org/T387287)
[14:24:52] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] Remove $wmgUseGraphWithJsonNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122709 (https://phabricator.wikimedia.org/T124748) (owner: 10Bartosz Dziewoński)
[14:28:49] <wikibugs>	 (03PS7) 10Gkyziridis: inference-services: deployment for edit-check dummy model.  - Fix typo in edit-check folder.  - Add newest image version  - Try to fix the failing linting step  - Copy hooks from readability model  - Add edit-check under /experimental/values-ml-staging-codfw.yaml  - Remove edit-check folder deploy it only under experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabric
[14:30:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[14:30:13] <wikibugs>	 (03PS1) 10Ayounsi: Duplicate gNMI BGP session state to metric with peer_descr as instance [puppet] - 10https://gerrit.wikimedia.org/r/1122957 (https://phabricator.wikimedia.org/T387287)
[14:30:58] <wikibugs>	 (03PS1) 10Vgutierrez: liberica: Enable gobgpd pprof/metrics endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1122958 (https://phabricator.wikimedia.org/T386687)
[14:30:59] <wikibugs>	 (03PS1) 10Vgutierrez: prometheus::ops: Gather gobgpd metrics on liberica hosts [puppet] - 10https://gerrit.wikimedia.org/r/1122959 (https://phabricator.wikimedia.org/T386687)
[14:31:14] <wikibugs>	 (03CR) 10Gkyziridis: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[14:31:44] <wikibugs>	 (03PS2) 10Vgutierrez: liberica: Enable gobgpd pprof/metrics endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1122958 (https://phabricator.wikimedia.org/T386687)
[14:31:44] <wikibugs>	 (03PS2) 10Vgutierrez: prometheus::ops: Gather gobgpd metrics on liberica hosts [puppet] - 10https://gerrit.wikimedia.org/r/1122959 (https://phabricator.wikimedia.org/T386687)
[14:32:03] <wikibugs>	 (03CR) 10Bking: wdqs: add routing for legacy full graph host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[14:32:24] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122958 (https://phabricator.wikimedia.org/T386687) (owner: 10Vgutierrez)
[14:33:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1122955 (https://phabricator.wikimedia.org/T387287) (owner: 10Ayounsi)
[14:34:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] prometheus::ops: Gather gobgpd metrics on liberica hosts [puppet] - 10https://gerrit.wikimedia.org/r/1122959 (https://phabricator.wikimedia.org/T386687) (owner: 10Vgutierrez)
[14:35:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[14:35:34] <wikibugs>	 (03CR) 10Gkyziridis: inference-services: deployment for edit-check dummy model.  - Fix typo in edit-check folder.  - Add newest image version  - Try to fix the f (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[14:36:05] <wikibugs>	 (03PS3) 10Bking: wdqs: add routing for legacy full graph host [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[14:36:05] <wikibugs>	 (03CR) 10Ayounsi: "Yep, making sure I get the +1 from Cathal before deploying. Then feel free to deploy when I'm away." [puppet] - 10https://gerrit.wikimedia.org/r/1122955 (https://phabricator.wikimedia.org/T387287) (owner: 10Ayounsi)
[14:36:19] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1014.eqiad.wmnet with reason: host reimage
[14:36:25] <wikibugs>	 (03CR) 10Bking: wdqs: add routing for legacy full graph host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[14:36:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1122957 (https://phabricator.wikimedia.org/T387287) (owner: 10Ayounsi)
[14:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:29] <wikibugs>	 (03PS3) 10Vgutierrez: liberica: Enable gobgpd pprof/metrics endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1122958 (https://phabricator.wikimedia.org/T386687)
[14:39:30] <wikibugs>	 (03PS3) 10Vgutierrez: prometheus::ops: Gather gobgpd metrics on liberica hosts [puppet] - 10https://gerrit.wikimedia.org/r/1122959 (https://phabricator.wikimedia.org/T386687)
[14:39:35] <wikibugs>	 (03CR) 10Ayounsi: Duplicate gNMI BGP session state to metric with peer_descr as instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1122957 (https://phabricator.wikimedia.org/T387287) (owner: 10Ayounsi)
[14:39:50] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1014.eqiad.wmnet with reason: host reimage
[14:40:52] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: java updates - bking@cumin2002 - T377938
[14:41:33] <wikibugs>	 (03PS13) 10Brouberol: global_config: add external services for opensearch clusters [puppet] - 10https://gerrit.wikimedia.org/r/1122900 (https://phabricator.wikimedia.org/T380752)
[14:42:22] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122958 (https://phabricator.wikimedia.org/T386687) (owner: 10Vgutierrez)
[14:44:05] <wikibugs>	 (03PS2) 10Gergő Tisza: CentralAuth: Enable SUL3 signup on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120968 (https://phabricator.wikimedia.org/T384007)
[14:45:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73695 and previous config saved to /var/cache/conftool/dbconfig/20250226-144555-root.json
[14:48:18] <wikibugs>	 (03PS2) 10Hnowlan: mobileapps: add networkpolicy for prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122932 (https://phabricator.wikimedia.org/T372749)
[14:48:53] <wikibugs>	 (03PS1) 10Kamila Součková: benthos: add input/output config to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214)
[14:49:09] <wikibugs>	 (03PS3) 10Hnowlan: mobileapps: add networkpolicy for prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122932 (https://phabricator.wikimedia.org/T372749)
[14:49:15] <tgr|away>	 Amir1: are you still deploying? I have a last-minute addition to the window
[14:51:23] <tgr|away>	 (not logged in on the deploy host so I guess not)
[14:52:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120968 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza)
[14:53:32] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Update evaluators from 2025-02-20-142923 to 2025-02-24-145135 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122962 (https://phabricator.wikimedia.org/T386972)
[14:53:33] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-02-20-140756 to 2025-02-25-210518 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122963 (https://phabricator.wikimedia.org/T379977)
[14:53:53] <Amir1>	 tgr|away: yeah, I went for lunch
[14:55:14] <wikibugs>	 (03Merged) 10jenkins-bot: CentralAuth: Enable SUL3 signup on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120968 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza)
[14:55:40] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1120968|CentralAuth: Enable SUL3 signup on group 0 (T384007)]]
[14:55:44] <stashbot>	 T384007: SUL3 Phase 1: All new account creation on group 0 wikis - https://phabricator.wikimedia.org/T384007
[14:57:05] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] mobileapps: add networkpolicy for prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122932 (https://phabricator.wikimedia.org/T372749) (owner: 10Hnowlan)
[14:57:34] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mobileapps: add networkpolicy for prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122932 (https://phabricator.wikimedia.org/T372749) (owner: 10Hnowlan)
[14:58:25] <wikibugs>	 (03CR) 10Kevin Bazira: inference-services: deployment for edit-check dummy model.  - Fix typo in edit-check folder.  - Add newest image version  - Try to fix the f (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[14:58:36] <logmsgbot>	 !log tgr@deploy2002 tgr: Backport for [[gerrit:1120968|CentralAuth: Enable SUL3 signup on group 0 (T384007)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:58:45] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: add networkpolicy for prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122932 (https://phabricator.wikimedia.org/T372749) (owner: 10Hnowlan)
[14:59:20] <wikibugs>	 (03CR) 10Eevans: [C:03+2] sessionstore: upgrade to 'dev' (Cassandra 4.1.8) [puppet] - 10https://gerrit.wikimedia.org/r/1122695 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans)
[15:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T1500)
[15:00:06] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10583340 (10Jhancock.wm) @MatthewVernon how's the two OS drives looking now?
[15:01:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73696 and previous config saved to /var/cache/conftool/dbconfig/20250226-150100-root.json
[15:01:22] <James_F>	 tgr|away: Deploy complete? OK if I start the Wikifunctions service deploy?
[15:01:28] <wikibugs>	 (03PS1) 10Sergio Gimeno: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122967 (https://phabricator.wikimedia.org/T386979)
[15:01:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:02:43] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply
[15:02:53] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[15:03:10] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore2004.codfw.wmnet: Upgrading to Cassandra 4.1.8 — T385819 - eevans@cumin1002
[15:03:37] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[15:04:02] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[15:04:26] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] prometheus::ops: Gather gobgpd metrics on liberica hosts [puppet] - 10https://gerrit.wikimedia.org/r/1122959 (https://phabricator.wikimedia.org/T386687) (owner: 10Vgutierrez)
[15:04:27] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[15:04:45] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[15:04:45] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1014.eqiad.wmnet with OS bookworm
[15:05:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1044.eqiad.wmnet
[15:05:19] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10583372 (10Jclark-ctr) END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jc...
[15:05:31] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10583373 (10Jclark-ctr)
[15:06:01] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10583375 (10Jclark-ctr) 05Open→03Resolved a:05jcrespo→03Jclark-ctr
[15:06:22] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Antoine!" [puppet] - 10https://gerrit.wikimedia.org/r/1122901 (https://phabricator.wikimedia.org/T387276) (owner: 10Hashar)
[15:06:42] <wikibugs>	 (03CR) 10Sergio Gimeno: [C:03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122967 (https://phabricator.wikimedia.org/T386979) (owner: 10Sergio Gimeno)
[15:06:57] <tgr|away>	 James_F: just a sec, I'll roll back
[15:07:02] <logmsgbot>	 !log tgr@deploy2002 Sync cancelled.
[15:07:02] <James_F>	 No worries.
[15:07:14] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1122958 (https://phabricator.wikimedia.org/T386687) (owner: 10Vgutierrez)
[15:07:39] <wikibugs>	 (03PS1) 10TrainBranchBot: Revert "CentralAuth: Enable SUL3 signup on group 0" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122969
[15:07:39] <wikibugs>	 (03CR) 10TrainBranchBot: "tgr@deploy2002 created a revert of this change as I9e8451d22cd2d975e55ddba83ed06a7e98c15398" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120968 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza)
[15:08:07] <wikibugs>	 (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122967 (https://phabricator.wikimedia.org/T386979) (owner: 10Sergio Gimeno)
[15:08:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122969 (owner: 10TrainBranchBot)
[15:09:02] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "CentralAuth: Enable SUL3 signup on group 0" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122969 (owner: 10TrainBranchBot)
[15:09:08] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: java updates - bking@cumin2002 - T377938
[15:09:31] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore2004.codfw.wmnet: Upgrading to Cassandra 4.1.8 — T385819 - eevans@cumin1002
[15:09:33] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1122969|Revert "CentralAuth: Enable SUL3 signup on group 0"]]
[15:09:58] <wikibugs>	 10SRE-swift-storage, 06Commons: Unable to restore File:Blason_famille_fr_de-Lichy_(2).svg - https://phabricator.wikimedia.org/T387340#10583394 (10A_smart_kitten) Adding to the #sre-swift-storage queue for triage
[15:10:32] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore1004.eqiad.wmnet: Upgrading to Cassandra 4.1.8 (canary) — T385819 - eevans@cumin1002
[15:12:33] <logmsgbot>	 !log tgr@deploy2002 tgr, trainbranchbot: Backport for [[gerrit:1122969|Revert "CentralAuth: Enable SUL3 signup on group 0"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[15:12:43] <logmsgbot>	 !log tgr@deploy2002 tgr, trainbranchbot: Continuing with sync
[15:13:28] <logmsgbot>	 !log sgimeno@deploy2002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply
[15:15:06] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] liberica: Enable gobgpd pprof/metrics endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1122958 (https://phabricator.wikimedia.org/T386687) (owner: 10Vgutierrez)
[15:15:44] <logmsgbot>	 !log sgimeno@deploy2002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply
[15:16:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73697 and previous config saved to /var/cache/conftool/dbconfig/20250226-151606-root.json
[15:16:29] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore1004.eqiad.wmnet: Upgrading to Cassandra 4.1.8 (canary) — T385819 - eevans@cumin1002
[15:16:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73698 and previous config saved to /var/cache/conftool/dbconfig/20250226-151641-root.json
[15:19:01] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122969|Revert "CentralAuth: Enable SUL3 signup on group 0"]] (duration: 09m 28s)
[15:19:31] <wikibugs>	 (03PS1) 10Hnowlan: citoid: migrate group1 wikis to use rest-gateway instead of restbase [puppet] - 10https://gerrit.wikimedia.org/r/1122973 (https://phabricator.wikimedia.org/T361576)
[15:19:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] citoid: migrate group1 wikis to use rest-gateway instead of restbase [puppet] - 10https://gerrit.wikimedia.org/r/1122973 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan)
[15:20:48] <logmsgbot>	 !log sgimeno@deploy2002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply
[15:20:57] <wikibugs>	 (03PS1) 10Elukey: Revert "kserve-inference: remove the need for the kserve container's securityContext" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122974
[15:21:03] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Update evaluators from 2025-02-20-142923 to 2025-02-24-145135 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122962 (https://phabricator.wikimedia.org/T386972) (owner: 10Jforrester)
[15:21:08] <wikibugs>	 (03PS2) 10Elukey: Revert "kserve-inference: remove the need for the kserve container's securityContext" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122974
[15:21:41] <tgr|away>	 !log UTC afternoon deploys done
[15:21:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:45] <tgr|away>	 sorry for the delay
[15:22:08] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=maps2005.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl
[15:22:11] <James_F>	 No worries, it happens. Much better to test and revert than leave it broken to meet the time window deadline.
[15:22:17] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=maps2005.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl
[15:22:21] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Update evaluators from 2025-02-20-142923 to 2025-02-24-145135 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122962 (https://phabricator.wikimedia.org/T386972) (owner: 10Jforrester)
[15:22:42] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=maps1005.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl
[15:22:50] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=maps1005.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl
[15:23:54] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:24:32] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:24:46] <logmsgbot>	 !log sgimeno@deploy2002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply
[15:25:04] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[15:25:51] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[15:25:53] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:26:23] <wikibugs>	 (03PS2) 10Hnowlan: citoid: migrate group1 wikis to use rest-gateway instead of restbase [puppet] - 10https://gerrit.wikimedia.org/r/1122973 (https://phabricator.wikimedia.org/T361576)
[15:26:31] <wikibugs>	 (03CR) 10Klausman: [C:03+1] Revert "kserve-inference: remove the need for the kserve container's securityContext" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122974 (owner: 10Elukey)
[15:26:37] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[15:26:42] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:27:07] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[15:27:35] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Revert "kserve-inference: remove the need for the kserve container's securityContext" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122974 (owner: 10Elukey)
[15:27:39] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] prometheus::ops: Gather gobgpd metrics on liberica hosts [puppet] - 10https://gerrit.wikimedia.org/r/1122959 (https://phabricator.wikimedia.org/T386687) (owner: 10Vgutierrez)
[15:28:22] <logmsgbot>	 !log sgimeno@deploy2002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply
[15:28:27] <moritzm>	 !log depooled maps2009 for server move T383709
[15:28:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:31] <stashbot>	 T383709: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709
[15:28:35] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Update orchestrator from 2025-02-20-140756 to 2025-02-25-210518 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122963 (https://phabricator.wikimedia.org/T379977) (owner: 10Jforrester)
[15:28:49] <icinga-wm>	 PROBLEM - ganeti-confd running on ganeti1024 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti
[15:29:03] <icinga-wm>	 PROBLEM - ganeti-noded running on ganeti1024 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[15:29:38] <logmsgbot>	 !log sgimeno@deploy2002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply
[15:29:46] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10583465 (10MoritzMuehlenhoff) @VRiley-WMF ganeti1044 is drained, you can move it around.
[15:29:50] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator from 2025-02-20-140756 to 2025-02-25-210518 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122963 (https://phabricator.wikimedia.org/T379977) (owner: 10Jforrester)
[15:30:07] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:30:36] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:31:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73699 and previous config saved to /var/cache/conftool/dbconfig/20250226-153111-root.json
[15:31:36] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ms-be2088 - https://phabricator.wikimedia.org/T387257#10583481 (10Jhancock.wm) result of testing with luca. leaving open until March 7th to catch any other errors. disregard.
[15:31:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73700 and previous config saved to /var/cache/conftool/dbconfig/20250226-153146-root.json
[15:32:32] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[15:33:39] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore[2005-2006].codfw.wmnet,sessionstore[1005-1006].eqiad.wmnet: Upgrading to Cassandra 4.1.8 — T385819 - eevans@cumin1002
[15:34:08] <wikibugs>	 (03PS2) 10Kamila Součková: benthos: add input/output config to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214)
[15:34:19] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[15:34:21] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:34:37] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122976
[15:35:10] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:37:09] <wikibugs>	 (03PS3) 10Kamila Součková: benthos: add input/output config to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214)
[15:39:17] <wikibugs>	 (03PS8) 10Gkyziridis: inference-services: deployment for edit-check dummy model.  - Fix typo in edit-check folder.  - Add newest image version  - Try to fix the failing linting step  - Copy hooks from readability model  - Add edit-check under /experimental/values-ml-staging-codfw.yaml  - Remove edit-check folder deploy it only under experimental  - Add MODEL_NAME at edit-check custom_env [deployment-charts] - 10https://gerri
[15:39:18] <wikibugs>	 (https://phabricator.wikimedia.org/T386100)
[15:46:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73701 and previous config saved to /var/cache/conftool/dbconfig/20250226-154616-root.json
[15:46:36] <wikibugs>	 (03PS7) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb)
[15:46:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73702 and previous config saved to /var/cache/conftool/dbconfig/20250226-154651-root.json
[15:47:36] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10583538 (10cmooney) @fgiunchedi I wonder if you might have any ideas on this.  Our routers and our switches are exporting timestamps with different number of digits: ` gn...
[15:47:47] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10583539 (10Jhancock.wm) @elukey try now. it got disabled on the nic.
[15:48:20] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356 (10RobH) 03NEW
[15:50:13] <wikibugs>	 (03PS9) 10Gkyziridis: inference-services: deployment for edit-check dummy model.  - Fix typo in edit-check folder.  - Add newest image version  - Try to fix the failing linting step  - Copy hooks from readability model  - Add edit-check under /experimental/values-ml-staging-codfw.yaml  - Remove edit-check folder deploy it only under experimental  - Add MODEL_NAME at edit-check custom_env  - Create a swift bucket: s3://wmf-ml-
[15:50:13] <wikibugs>	 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100)
[15:50:19] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[15:51:44] <wikibugs>	 (03CR) 10Gkyziridis: inference-services: deployment for edit-check dummy model.  - Fix typo in edit-check folder.  - Add newest image version  - Try to fix the f (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[15:53:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb)
[15:55:22] <wikibugs>	 (03CR) 10Bking: global_config: add external services for opensearch clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1122900 (https://phabricator.wikimedia.org/T380752) (owner: 10Brouberol)
[15:55:26] <wikibugs>	 (03CR) 10Mvolz: citoid: migrate group1 wikis to use rest-gateway instead of restbase (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1122973 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan)
[15:56:22] <swfrench-wmf>	 jouncebot: nowandnext
[15:56:22] <jouncebot>	 For the next 0 hour(s) and 3 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T1500)
[15:56:22] <jouncebot>	 In 2 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T1800)
[15:56:30] <James_F>	 Clear here.
[15:56:58] <swfrench-wmf>	 thanks, James_F!
[15:58:16] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] Re-enroll 5% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122655 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[15:58:20] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore[2005-2006].codfw.wmnet,sessionstore[1005-1006].eqiad.wmnet: Upgrading to Cassandra 4.1.8 — T385819 - eevans@cumin1002
[15:58:23] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2075.codfw.wmnet with OS bullseye
[15:58:32] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10583598 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye
[15:58:47] <wikibugs>	 (03CR) 10Bartosz Dziewoński: Deduplicate JsonConfig config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 (owner: 10Bartosz Dziewoński)
[15:59:17] <wikibugs>	 (03PS11) 10Elukey: aux-k8s-ctrl codfw: apply role [puppet] - 10https://gerrit.wikimedia.org/r/1122170 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron)
[16:01:08] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10Data-Platform-SRE (2025.02.10 - 2025.02.28): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10583616 (10xcollazo) @cmooney, should we move forward with this patch sometime soon?
[16:01:15] <wikibugs>	 (03CR) 10Elukey: "Two unresolved comments and then you are good to go!" [puppet] - 10https://gerrit.wikimedia.org/r/1122170 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron)
[16:01:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73703 and previous config saved to /var/cache/conftool/dbconfig/20250226-160156-root.json
[16:02:39] <wikibugs>	 (03PS12) 10Herron: aux-k8s-ctrl codfw: apply role [puppet] - 10https://gerrit.wikimedia.org/r/1122170 (https://phabricator.wikimedia.org/T381417)
[16:02:39] <wikibugs>	 (03PS2) 10Ayounsi: Duplicate gNMI BGP session state to metric with peer_descr as instance [puppet] - 10https://gerrit.wikimedia.org/r/1122957 (https://phabricator.wikimedia.org/T387287)
[16:03:19] <swfrench-wmf>	 !log cumin 'A:cp-text' 'disable-puppet "merging ATS Lua config change - T383845"'
[16:03:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:24] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[16:03:44] <wikibugs>	 (03CR) 10Herron: aux-k8s-ctrl codfw: apply role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1122170 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron)
[16:03:55] <wikibugs>	 (03CR) 10Scott French: [C:03+2] "Thanks again! Moving ahead with this now." [puppet] - 10https://gerrit.wikimedia.org/r/1122584 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli)
[16:06:04] <brennen>	 jouncebot nowandnext
[16:06:04] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 53 minute(s)
[16:06:04] <jouncebot>	 In 1 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T1800)
[16:10:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1122562 (https://phabricator.wikimedia.org/T385947) (owner: 10Slyngshede)
[16:13:28] <wikibugs>	 (03CR) 10Elukey: [C:03+1] aux-k8s-ctrl codfw: apply role [puppet] - 10https://gerrit.wikimedia.org/r/1122170 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron)
[16:17:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73704 and previous config saved to /var/cache/conftool/dbconfig/20250226-161701-root.json
[16:17:21] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+1] "besides a few minor nits in the commit message, the rest LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[16:17:42] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2075.codfw.wmnet with reason: host reimage
[16:18:12] <wikibugs>	 (03CR) 10Thcipriani: [C:03+1] gerrit: remove explicit UseG1GC flag [puppet] - 10https://gerrit.wikimedia.org/r/1122899 (https://phabricator.wikimedia.org/T387223) (owner: 10Hashar)
[16:19:57] <Amir1>	 !log dropping incorrectly created tables in new wikis (T352113)
[16:20:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:00] <stashbot>	 T352113: Move the addWiki.php maintenance script from WikimediaMaintenance into MediaWiki core - https://phabricator.wikimedia.org/T352113
[16:21:48] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2075.codfw.wmnet with reason: host reimage
[16:24:07] <logmsgbot>	 !log brennen@deploy2002 Started deploy [phabricator/deployment@43155d4]: deploy phab2002 for T387172
[16:24:10] <stashbot>	 T387172: Deploy Phabricator/Phorge 2025-02-25 - https://phabricator.wikimedia.org/T387172
[16:24:35] <logmsgbot>	 !log brennen@deploy2002 Finished deploy [phabricator/deployment@43155d4]: deploy phab2002 for T387172 (duration: 00m 28s)
[16:24:57] <logmsgbot>	 !log brennen@deploy2002 Started deploy [phabricator/deployment@43155d4]: deploy phab1004 for T387172
[16:25:46] <logmsgbot>	 !log brennen@deploy2002 Finished deploy [phabricator/deployment@43155d4]: deploy phab1004 for T387172 (duration: 00m 49s)
[16:26:33] <wikibugs>	 10ops-magru, 06DC-Ops, 10Observability-Metrics: missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10583708 (10tappof) @wiki_willy The data is missing because Prometheus is not configured to retrieve metrics from magru's PDUs, as they are not present in NetBox. As soon as they are added...
[16:34:14] <wikibugs>	 (03CR) 10AikoChou: "The patch looks good to me, but the commit title is loooong lol" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[16:34:42] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10583733 (10cmooney) My robot friend suggested this which works to adjust the result of the promql to the right units: ` gnmi_bgp_neighbor_last_established{instance="$devi...
[16:34:52] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3648 MB (3% inode=98%): /tmp 3648 MB (3% inode=98%): /var/tmp 3648 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[16:36:21] <wikibugs>	 (03PS1) 10Vgutierrez: hiera,cephadm: Enable IPIP on apus@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1122986 (https://phabricator.wikimedia.org/T387290)
[16:37:14] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122986 (https://phabricator.wikimedia.org/T387290) (owner: 10Vgutierrez)
[16:37:33] <swfrench-wmf>	 !log cumin -b8 -s90 'A:cp-text' 'run-puppet-agent -e "merging ATS Lua config change - T383845"'
[16:37:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:37] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[16:38:56] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10583782 (10VRiley-WMF) 05Open→03In progress Proceeding with action
[16:40:20] <wikibugs>	 (03PS3) 10Arlolra: Turn on Parsoid Read Views for 37 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122712 (https://phabricator.wikimedia.org/T387254)
[16:41:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Turn on Parsoid Read Views for 37 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122712 (https://phabricator.wikimedia.org/T387254) (owner: 10Arlolra)
[16:42:10] <icinga-wm>	 PROBLEM - Host ganeti1044 is DOWN: PING CRITICAL - Packet loss = 100%
[16:43:06] <wikibugs>	 (03PS2) 10Vgutierrez: hiera,cephadm: Enable IPIP on apus@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1122986 (https://phabricator.wikimedia.org/T387290)
[16:43:06] <wikibugs>	 (03PS1) 10Vgutierrez: hiera, cephadm: Enable IPIP on apus@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290)
[16:43:16] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2075.codfw.wmnet with OS bullseye
[16:43:22] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10583809 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye completed: - ms-be2075 (**PASS**)...
[16:44:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Duplicate gNMI BGP session state to metric with peer_descr as instance [puppet] - 10https://gerrit.wikimedia.org/r/1122957 (https://phabricator.wikimedia.org/T387287) (owner: 10Ayounsi)
[16:44:34] <jinxer-wm>	 FIRING: ProbeDown: Service ganeti1044:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:44:38] <wikibugs>	 (03PS4) 10Arlolra: Turn on Parsoid Read Views for 37 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122712 (https://phabricator.wikimedia.org/T387254)
[16:44:48] <icinga-wm>	 PROBLEM - Host maps2009 is DOWN: PING CRITICAL - Packet loss = 100%
[16:44:59] <swfrench-wmf>	 jouncebot: nowandnext
[16:45:00] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 15 minute(s)
[16:45:00] <jouncebot>	 In 1 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T1800)
[16:45:19] <wikibugs>	 (03PS1) 10Itamar Givon: Remove unused route file from Wikibase REST API configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122990 (https://phabricator.wikimedia.org/T383774)
[16:45:47] <wikibugs>	 (03PS10) 10Gkyziridis: inference-services: deployment for edit-check dummy model.  - Add newest image version  - Add edit-check under /experimental/values-ml-staging-codfw.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100)
[16:45:58] <swfrench-wmf>	 brennen: were you planning to deploy to mediawiki, or was the phab deploy the extent of it?
[16:46:11] <brennen>	 swfrench-wmf: just phab
[16:46:21] <swfrench-wmf>	 awesome, thanks!
[16:46:35] <wikibugs>	 (03PS3) 10Vgutierrez: hiera,cephadm: Enable IPIP on apus@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1122986 (https://phabricator.wikimedia.org/T387290)
[16:46:35] <wikibugs>	 (03PS2) 10Vgutierrez: hiera, cephadm: Enable IPIP on apus@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290)
[16:46:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:47:11] <moritzm>	 ^ expected, ganeti2044 is moved (but it's drained of VMs)
[16:47:46] <wikibugs>	 (03CR) 10Gkyziridis: inference-services: deployment for edit-check dummy model.  - Add newest image version  - Add edit-check under /experimental/values-ml-stagi (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[16:48:33] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+1] "Thnx for reviewing it folks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[16:49:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-pg-replication-lag.service on maps2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:50:42] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[16:51:00] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host maps2009
[16:51:01] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+1] "I do not have the option of +2 here. I just did a +1, probably someone else needs to merge it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[16:51:14] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host maps2009
[16:51:24] <icinga-wm>	 RECOVERY - Host maps2009 is UP: PING OK - Packet loss = 0%, RTA = 30.34 ms
[16:53:05] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122990 (https://phabricator.wikimedia.org/T383774) (owner: 10Itamar Givon)
[16:53:16] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10583883 (10Jhancock.wm)
[16:54:23] <moritzm>	 !log repooled maps2009 after completed server move T383709
[16:54:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: prometheus-pg-replication-lag.service on maps2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:54:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:26] <stashbot>	 T383709: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709
[16:54:28] <icinga-wm>	 RECOVERY - Host ganeti1044 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms
[16:56:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:57:22] <jinxer-wm>	 RESOLVED: ProbeDown: Service ganeti1044:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:00:05] <jouncebot>	 swfrench-wmf: That opportune time for a MediaWiki infrastructure (one-off) deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T1700).
[17:01:14] <swfrench-wmf>	 o/
[17:02:00] <wikibugs>	 (03CR) 10Scott French: "Thank you both for the reviews!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122655 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[17:02:02] <wikibugs>	 (03PS1) 10Vgutierrez: hiera,titan: Enable IPIP on thanos-(query|web)@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1122995 (https://phabricator.wikimedia.org/T387291)
[17:02:47] <wikibugs>	 (03PS1) 10Hnowlan: switchdc: remove metal jobrunner references [cookbooks] - 10https://gerrit.wikimedia.org/r/1122996 (https://phabricator.wikimedia.org/T385155)
[17:02:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122655 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[17:03:14] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122995 (https://phabricator.wikimedia.org/T387291) (owner: 10Vgutierrez)
[17:04:00] <wikibugs>	 (03Merged) 10jenkins-bot: Re-enroll 5% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122655 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[17:04:27] <logmsgbot>	 !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1122655|Re-enroll 5% of client sessions in PHP 8.1 (T383845 T385395)]]
[17:04:32] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[17:04:32] <stashbot>	 T385395: 503 error when edit large size pages on PHP 8.1 - https://phabricator.wikimedia.org/T385395
[17:04:43] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122997
[17:05:44] <wikibugs>	 (03PS3) 10Kimberly Sarabia: Add config for donate banner to be enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122671 (https://phabricator.wikimedia.org/T386767)
[17:05:48] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "\o/" [cookbooks] - 10https://gerrit.wikimedia.org/r/1122996 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan)
[17:07:16] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122999
[17:07:28] <logmsgbot>	 !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1122655|Re-enroll 5% of client sessions in PHP 8.1 (T383845 T385395)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[17:08:27] <logmsgbot>	 !log swfrench@deploy2002 swfrench: Continuing with sync
[17:09:12] <wikibugs>	 (03PS2) 10Vgutierrez: hiera,titan: Enable IPIP on thanos-(query|web)@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1122995 (https://phabricator.wikimedia.org/T387291)
[17:09:35] <wikibugs>	 10ops-magru, 06DC-Ops, 10Observability-Metrics: missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10583988 (10wiki_willy) Hi @tappof - thanks for looking into this.  It looks like the PDUs are in Netbox though; they were added about a year ago in May 2024:  https://netbox.wikimedia.org/...
[17:09:45] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122995 (https://phabricator.wikimedia.org/T387291) (owner: 10Vgutierrez)
[17:10:35] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10583993 (10Jhancock.wm) @Scott_French would you be able to (or know who) could help me move conf2005 to clear u...
[17:11:35] <wikibugs>	 (03CR) 10Scott French: [C:03+1] switchdc: remove metal jobrunner references (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1122996 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan)
[17:12:50] <wikibugs>	 (03PS1) 10Vgutierrez: hiera,titan: Enable IPIP on thanos-(query|web)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123000 (https://phabricator.wikimedia.org/T387291)
[17:13:21] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123000 (https://phabricator.wikimedia.org/T387291) (owner: 10Vgutierrez)
[17:15:00] <logmsgbot>	 !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122655|Re-enroll 5% of client sessions in PHP 8.1 (T383845 T385395)]] (duration: 10m 33s)
[17:15:05] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[17:15:06] <stashbot>	 T385395: 503 error when edit large size pages on PHP 8.1 - https://phabricator.wikimedia.org/T385395
[17:16:07] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10584031 (10Scott_French) @Jhancock.wm - Thanks for flagging! Yes, I can help you with that. I'll open a task sp...
[17:16:37] <wikibugs>	 10ops-magru, 06DC-Ops, 10Observability-Metrics: missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10584034 (10tappof) Thank you, @wiki_willy, for pointing me in the right direction within NetBox. It seems the PuppetQL query might need to be updated (different model and/or type?). I'll t...
[17:21:14] <wikibugs>	 (03PS6) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023)
[17:23:13] <wikibugs>	 (03CR) 10Ollie Shotton: [C:03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122990 (https://phabricator.wikimedia.org/T383774) (owner: 10Itamar Givon)
[17:23:47] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[17:26:09] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Unable to restore File:Blason_famille_fr_de-Lichy_(2).svg - https://phabricator.wikimedia.org/T387340#10584096 (10MatthewVernon) Looking harder at the timestamps around 13:12:28 just of the archive URL, going by the high-resolution timestamp order:...
[17:26:19] <wikibugs>	 (03PS1) 10Vgutierrez: hiera,wmcs: Enable IPIP on labweb-ssl@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123005 (https://phabricator.wikimedia.org/T387305)
[17:26:44] <logmsgbot>	 !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on ms-be2088.codfw.wmnet with reason: T381919
[17:26:47] <stashbot>	 T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919
[17:28:17] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123005 (https://phabricator.wikimedia.org/T387305) (owner: 10Vgutierrez)
[17:29:57] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122671 (https://phabricator.wikimedia.org/T386767) (owner: 10Kimberly Sarabia)
[17:33:35] <wikibugs>	 (03CR) 10AikoChou: inference-services: deployment for edit-check dummy model.  - Add newest image version  - Add edit-check under /experimental/values-ml-stagi (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[17:33:41] <wikibugs>	 (03PS1) 10Ollie Shotton: Test new term store config in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123007 (https://phabricator.wikimedia.org/T385592)
[17:36:19] <wikibugs>	 (03PS5) 10Simon04: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285)
[17:37:12] <wikibugs>	 (03CR) 10Simon04: "Done. Looking forward to your review." [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04)
[17:38:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04)
[17:41:30] <zabe>	 !log zabe@mwmaint2002:~$ mwscript namespaceDupes.php sylwiki --fix # T387266
[17:41:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:35] <stashbot>	 T387266: MediaWiki\Page\PageAssertionException on https://syl.wikipedia.org/w/index.php?title=ꠃꠁꠇꠤꠙꠤꠒꠤꠀ:ꠀꠅꠇꠣ_ꠘꠄꠀꠁꠘ&diff=prev&oldid=9645 - https://phabricator.wikimedia.org/T387266
[17:44:06] <zabe>	 !log zabe@mwmaint2002:~$ mwscript namespaceDupes.php sylwiki --fix --add-prefix T387266
[17:44:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:28] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] wikimedia-ech: add ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1122155 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[17:45:46] <wikibugs>	 (03PS3) 10Ssingh: wikimedia-ech: add ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1122155 (https://phabricator.wikimedia.org/T205378)
[17:46:42] <wikibugs>	 (03CR) 10AikoChou: "You should have +2 option as well (Tobias can fix this)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[17:47:49] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10584273 (10VRiley-WMF) 05In progress→03Resolved a:03VRiley-WMF ganeti1044 has been relocated to U30 in the same rack with the same co...
[17:55:05] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1122995 (https://phabricator.wikimedia.org/T387291) (owner: 10Vgutierrez)
[17:55:54] <wikibugs>	 (03CR) 10Andrea Denisse: "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1123000 (https://phabricator.wikimedia.org/T387291) (owner: 10Vgutierrez)
[17:56:03] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] hiera,titan: Enable IPIP on thanos-(query|web)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123000 (https://phabricator.wikimedia.org/T387291) (owner: 10Vgutierrez)
[18:00:03] <wikibugs>	 (03PS2) 10Hnowlan: switchdc: remove metal jobrunner references [cookbooks] - 10https://gerrit.wikimedia.org/r/1122996 (https://phabricator.wikimedia.org/T385155)
[18:00:05] <jouncebot>	 swfrench-wmf: Time to snap out of that daydream and deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T1800).
[18:00:13] <swfrench-wmf>	 o/
[18:00:18] <wikibugs>	 (03CR) 10Hnowlan: switchdc: remove metal jobrunner references (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1122996 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan)
[18:00:30] <wikibugs>	 (03PS4) 10Kamila Součková: benthos: add input/output config to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214)
[18:00:30] <wikibugs>	 (03PS1) 10Kamila Součková: benthos-mw-accesslog-metrics: create deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123010
[18:00:45] <swfrench-wmf>	 I plan to use the second half hour of this window, so if anyone needs the first half hour for anything, please go ahead
[18:01:37] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: java updates - bking@cumin2002 - T377938
[18:06:11] <wikibugs>	 (03PS11) 10Gkyziridis: inference-services: deployment for edit-check dummy model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100)
[18:07:26] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-be105[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T385049#10584344 (10VRiley-WMF)
[18:08:09] <wikibugs>	 (03CR) 10Subramanya Sastry: [C:03+1] Turn on Parsoid Read Views for 37 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122712 (https://phabricator.wikimedia.org/T387254) (owner: 10Arlolra)
[18:13:06] <wikibugs>	 (03PS1) 10Cathal Mooney: Allow HTTPS connections from production to mgmt networks [homer/public] - 10https://gerrit.wikimedia.org/r/1123014 (https://phabricator.wikimedia.org/T371088)
[18:32:58] <wikibugs>	 (03CR) 10AikoChou: [C:03+2] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[18:34:13] <wikibugs>	 (03Merged) 10jenkins-bot: inference-services: deployment for edit-check dummy model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[18:34:14] <wikibugs>	 (03CR) 10Ssingh: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1122155 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[18:35:12] <wikibugs>	 (03CR) 10Volans: Allow HTTPS connections from production to mgmt networks (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1123014 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney)
[18:36:10] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[18:38:10] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[18:39:44] <wikibugs>	 (03PS2) 10Scott French: mw-(api-int|jobrunner|parsoid): resume php8.1 rollout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122587 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli)
[18:40:41] <wikibugs>	 (03PS6) 10Simon04: www.wikipedia.org: fix "search" URL parameter [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285)
[18:42:31] <swfrench-wmf>	 alright, I'm back and will be making my planned changes shortly
[18:48:55] <wikibugs>	 (03CR) 10Cathal Mooney: Allow HTTPS connections from production to mgmt networks (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1123014 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney)
[18:51:06] <wikibugs>	 (03PS2) 10Cathal Mooney: Allow HTTPS connections from production to mgmt networks [homer/public] - 10https://gerrit.wikimedia.org/r/1123014 (https://phabricator.wikimedia.org/T371088)
[18:52:17] <wikibugs>	 (03CR) 10Cathal Mooney: "Thanks for the feedback @volans.  I've changed this slightly, still adding a new term but not including the cumin_group as it's already al" [homer/public] - 10https://gerrit.wikimedia.org/r/1123014 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney)
[18:53:10] <wikibugs>	 (03CR) 10Cathal Mooney: Allow HTTPS connections from production to mgmt networks (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1123014 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney)
[18:56:49] <swfrench-wmf>	 alas, my change will need to wait for now. I'll follow up later on today in an idle window.
[19:00:06] <jouncebot>	 dduvall and andre: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T1900).
[19:02:50] <wikibugs>	 (03PS2) 10Scott French: Re-enable cookie-based enrollment in 8.1 at 50% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122585 (https://phabricator.wikimedia.org/T385395) (owner: 10Effie Mouzeli)
[19:03:35] <wikibugs>	 (03CR) 10Scott French: "Rebased." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122585 (https://phabricator.wikimedia.org/T385395) (owner: 10Effie Mouzeli)
[19:08:15] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123017 (https://phabricator.wikimedia.org/T382369)
[19:08:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123017 (https://phabricator.wikimedia.org/T382369) (owner: 10TrainBranchBot)
[19:09:09] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123017 (https://phabricator.wikimedia.org/T382369) (owner: 10TrainBranchBot)
[19:10:56] <wikibugs>	 06SRE, 06Traffic, 07Wikimedia-production-error: Reproducible blocking error using the basic upload form, no upload possible - https://phabricator.wikimedia.org/T387007#10584517 (10Aklapper)
[19:12:02] <wikibugs>	 06SRE, 06Traffic: Reproducible blocking error using the basic upload form, no upload possible - https://phabricator.wikimedia.org/T387007#10584521 (10Aklapper)
[19:18:34] <logmsgbot>	 !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.18  refs T382369
[19:18:38] <stashbot>	 T382369: 1.44.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T382369
[19:21:04] <wikibugs>	 06SRE, 06Traffic: Reproducible blocking error using the basic upload form, no upload possible - https://phabricator.wikimedia.org/T387007#10584529 (10ssingh) @Grand-Duc: Hi, does this still persist for you? Or has it resolved?
[19:24:04] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Explicitly, this still LGTM, but it would be preferable to get a second pair of eyes on this following my edits." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122587 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli)
[19:24:18] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Explicitly, this still LGTM, but it would be preferable to get a second pair of eyes on this following my edits." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122585 (https://phabricator.wikimedia.org/T385395) (owner: 10Effie Mouzeli)
[19:25:20] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: java updates - bking@cumin2002 - T377938
[19:25:59] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-be105[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T385049#10584533 (10VRiley-WMF) 05Open→03Resolved
[19:26:09] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-be105[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T385049#10584536 (10VRiley-WMF) This is completed
[19:34:46] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] mw-(api-int|jobrunner|parsoid): resume php8.1 rollout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122587 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli)
[19:36:24] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] Re-enable cookie-based enrollment in 8.1 at 50% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122585 (https://phabricator.wikimedia.org/T385395) (owner: 10Effie Mouzeli)
[19:38:54] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122712 (https://phabricator.wikimedia.org/T387254) (owner: 10Arlolra)
[19:41:40] <wikibugs>	 (03CR) 10David Caro: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe)
[19:44:06] <swfrench-wmf>	 dduvall: if I were to make a no-code-changes deployment (to shift a bit of traffic from PHP 7.4 to 8.1) during the latter half of your train window, would that be disruptive? no worries at all if you'd prefer I don't
[19:46:14] <Reedy>	 swfrench-wmf: I'm guessing if the train has moved, and is stable, it almost certainly won't be an issue (ie the latter half won't be used anyway)
[19:46:35] <dduvall>	 swfrench-wmf: i'm looking into an error at the moment but the rate is very low, so feel free to go ahead
[19:46:38] <dduvall>	 also, thanks for asking
[19:48:27] <swfrench-wmf>	 dduvall: Reedy: great, thank you both. I'll move ahead with my change in a couple of minutes.
[19:51:42] <wikibugs>	 (03PS1) 10Bvibber: Add JsonConfig's globaljsonlinks tables to catalog [puppet] - 10https://gerrit.wikimedia.org/r/1123022 (https://phabricator.wikimedia.org/T363581)
[19:52:32] <wikibugs>	 (03PS2) 10Bvibber: Add JsonConfig's globaljsonlinks tables to catalog [puppet] - 10https://gerrit.wikimedia.org/r/1123022 (https://phabricator.wikimedia.org/T363581)
[19:59:15] <wikibugs>	 (03PS3) 10Scott French: mw-(api-int|jobrunner|parsoid): resume php8.1 rollout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122587 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli)
[20:02:19] <wikibugs>	 (03CR) 10Scott French: [C:03+2] "One last issue I just noticed before merging, but should be good to go now. Thank you both!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122587 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli)
[20:03:35] <wikibugs>	 (03Merged) 10jenkins-bot: mw-(api-int|jobrunner|parsoid): resume php8.1 rollout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122587 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli)
[20:06:27] <logmsgbot>	 !log swfrench@deploy2002 Started scap sync-world: helmfile-only deployment to resume capacity-based 8.1 migrations - T383845
[20:06:32] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[20:08:36] <logmsgbot>	 !log swfrench@deploy2002 Finished scap sync-world: helmfile-only deployment to resume capacity-based 8.1 migrations - T383845 (duration: 03m 08s)
[20:25:51] <wikibugs>	 (03CR) 10Bvibber: "Looks correct glancing over it, but I haven't tested the output arrays to confirm they're not missing anything yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 (owner: 10Bartosz Dziewoński)
[20:25:53] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bookworm
[20:32:59] <logmsgbot>	 !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2088.codfw.wmnet with OS bookworm
[20:33:06] <wikibugs>	 (03PS1) 10BCornwall: cloud: update default acmechief_host host [puppet] - 10https://gerrit.wikimedia.org/r/1123028
[20:33:20] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:35:37] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] cloud: update default acmechief_host host [puppet] - 10https://gerrit.wikimedia.org/r/1123028 (owner: 10BCornwall)
[20:35:42] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2088.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[20:43:36] <wikibugs>	 (03PS1) 10Gergő Tisza: [WIP] Update CentralAuth multi-DC rules for SUL3, attempt 2 [puppet] - 10https://gerrit.wikimedia.org/r/1123029 (https://phabricator.wikimedia.org/T363695)
[20:44:31] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: java updates - bking@cumin2002 - T377938
[20:45:54] <logmsgbot>	 !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2088.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[20:47:41] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on ms-be2088 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387392 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[20:47:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ms-be2088 - https://phabricator.wikimedia.org/T387392 (10ops-monitoring-bot) 03NEW
[20:50:57] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bookworm
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T2100).
[21:00:05] <jouncebot>	 kimberly_sarabia and arlolra: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:16] <kimberly_sarabia>	 Hi! I'm here
[21:00:41] <arlolra>	 o/
[21:03:04] <cjming>	 o/
[21:03:07] <cjming>	 i can deploy :)
[21:03:21] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[21:03:52] <wikibugs>	 (03PS4) 10Kimberly Sarabia: Add config for donate banner to be enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122671 (https://phabricator.wikimedia.org/T386767)
[21:04:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122671 (https://phabricator.wikimedia.org/T386767) (owner: 10Kimberly Sarabia)
[21:05:36] <wikibugs>	 (03Merged) 10jenkins-bot: Add config for donate banner to be enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122671 (https://phabricator.wikimedia.org/T386767) (owner: 10Kimberly Sarabia)
[21:06:03] <kimberly_sarabia>	 cjming: tysm!
[21:06:26] <cjming>	 kimberly_sarabia: should be live :)
[21:06:36] <wikibugs>	 (03PS5) 10Arlolra: Turn on Parsoid Read Views for 37 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122712 (https://phabricator.wikimedia.org/T387254)
[21:07:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122712 (https://phabricator.wikimedia.org/T387254) (owner: 10Arlolra)
[21:07:57] <wikibugs>	 (03Merged) 10jenkins-bot: Turn on Parsoid Read Views for 37 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122712 (https://phabricator.wikimedia.org/T387254) (owner: 10Arlolra)
[21:08:10] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2088.codfw.wmnet with OS bookworm
[21:08:24] <logmsgbot>	 !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1122712|Turn on Parsoid Read Views for 37 wiktionaries (T387254)]]
[21:08:29] <stashbot>	 T387254: Parsoid Read Views to Wiktionary deploy ~2025-02-27 - https://phabricator.wikimedia.org/T387254
[21:09:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: elasticsearch-disable-readahead.service on elastic1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:10:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1126:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1126 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[21:10:49] <cjming>	 arlolra: on test servers
[21:10:56] <arlolra>	 looking
[21:11:20] <kimberly_sarabia>	 cjming: ty!
[21:11:23] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.008e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[21:11:34] <logmsgbot>	 !log cjming@deploy2002 cjming, arlolra: Backport for [[gerrit:1122712|Turn on Parsoid Read Views for 37 wiktionaries (T387254)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:11:34] <cjming>	 kimberly_sarabia: yw!
[21:12:00] <arlolra>	 cjming: lgtm
[21:12:08] <cjming>	 great - syncing
[21:12:11] <logmsgbot>	 !log cjming@deploy2002 cjming, arlolra: Continuing with sync
[21:15:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1126:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1126 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[21:18:35] <logmsgbot>	 !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122712|Turn on Parsoid Read Views for 37 wiktionaries (T387254)]] (duration: 10m 10s)
[21:18:39] <stashbot>	 T387254: Parsoid Read Views to Wiktionary deploy ~2025-02-27 - https://phabricator.wikimedia.org/T387254
[21:18:51] <arlolra>	 thanks cjming 
[21:19:04] <cjming>	 yw!
[21:19:34] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bookworm
[21:23:47] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[21:38:14] <wikibugs>	 (03PS1) 10Gergő Tisza: CentralAuth: Enable SUL3 signup on group 0 (attempt 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123032 (https://phabricator.wikimedia.org/T384007)
[21:39:02] <jhathaway>	 !log disabling auto reboot for debian imaging temporarily
[21:39:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: elasticsearch-disable-readahead.service on elastic1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:39:43] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2088.codfw.wmnet with OS bookworm
[21:40:04] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bookworm
[21:41:22] <tgr|away>	 cjming: you are not deploying anymore, right? I'd add one more patch
[21:44:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123032 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza)
[21:45:29] <wikibugs>	 (03Merged) 10jenkins-bot: CentralAuth: Enable SUL3 signup on group 0 (attempt 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123032 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza)
[21:45:59] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1123032|CentralAuth: Enable SUL3 signup on group 0 (attempt 2) (T384007)]]
[21:46:03] <stashbot>	 T384007: SUL3 Phase 1: All new account creation on group 0 wikis - https://phabricator.wikimedia.org/T384007
[21:49:03] <logmsgbot>	 !log tgr@deploy2002 tgr: Backport for [[gerrit:1123032|CentralAuth: Enable SUL3 signup on group 0 (attempt 2) (T384007)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:51:55] <wikibugs>	 (03PS1) 10Scott French: shellbox-media: revert to PHP 7.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123033 (https://phabricator.wikimedia.org/T377038)
[21:53:10] <wikibugs>	 (03CR) 10Scott French: [C:03+2] shellbox-media: revert to PHP 7.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123033 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[21:54:23] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-media: revert to PHP 7.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123033 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[21:56:22] <cjming>	 tgr: sorry - yes!
[21:56:51] <tgr|away>	 thanks, I figured it out from the logs eventually :)
[21:57:21] <logmsgbot>	 !log tgr@deploy2002 tgr: Continuing with sync
[21:58:25] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: java updates - bking@cumin2002 - T377938
[21:58:53] <swfrench-wmf>	 tgr|away: FYI, I'm going to be running a helmfile deployment for shellbox concurrent with your rollout. should not conflict in any way, but just wanted to flag it so it's not a surprise here.
[22:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T2200)
[22:00:44] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply
[22:01:51] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply
[22:03:57] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123032|CentralAuth: Enable SUL3 signup on group 0 (attempt 2) (T384007)]] (duration: 17m 57s)
[22:04:01] <stashbot>	 T384007: SUL3 Phase 1: All new account creation on group 0 wikis - https://phabricator.wikimedia.org/T384007
[22:04:56] <tgr|away>	 !log UTC late deploys done
[22:04:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:05:05] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply
[22:05:20] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply
[22:06:25] <swfrench-wmf>	 !log shellbox-media reverted to PHP 7.4 - T377038
[22:06:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:06:28] <stashbot>	 T377038: Migrate production Shellbox variants to PHP 8.1 - https://phabricator.wikimedia.org/T377038
[22:07:26] <inflatador>	 !log bking@cumin2002:~$ sudo apt-get install -y python3-opensearch T383811
[22:07:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:07:30] <stashbot>	 T383811: Ensure Search Platform-owned Elasticsearch cookbooks can handle Opensearch - https://phabricator.wikimedia.org/T383811
[22:14:09] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host backup2013.codfw.wmnet with OS bookworm
[22:14:11] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host backup2014.codfw.wmnet with OS bookworm
[22:14:22] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10584999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2013.codfw.wmnet with OS bookworm
[22:14:23] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10585000 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2014.codfw.wmnet with OS bookworm
[22:24:04] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2088.codfw.wmnet with OS bookworm
[22:27:17] <wikibugs>	 (03PS4) 10Bking: wdqs: add routing for legacy full graph host [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[22:27:56] <wikibugs>	 (03PS5) 10Bking: wdqs: add routing for legacy full graph host [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[22:28:21] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[22:30:35] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: security release
[22:34:19] <wikibugs>	 (03PS6) 10Bking: wdqs: add routing for legacy full graph host [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[22:35:27] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[22:35:39] <logmsgbot>	 dzahn@cumin1002 dzahn: The backup on gitlab1004 is complete, ready to proceed with upgrade.
[22:39:46] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release
[22:40:38] <wikibugs>	 (03PS7) 10Bking: wdqs: add routing for legacy full graph host [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[22:40:46] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[22:42:18] <logmsgbot>	 !log dzahn@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: security release
[22:43:29] <wikibugs>	 (03PS8) 10Bking: wdqs: add routing for legacy full graph host [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[22:44:52] <logmsgbot>	 dzahn@cumin1002 dzahn: The backup on gitlab1003 is complete, ready to proceed with upgrade.
[22:45:22] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[22:50:51] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bookworm
[22:53:27] <logmsgbot>	 !log dzahn@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: security release
[22:54:56] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: security release
[23:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T2300)
[23:08:01] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup2013.codfw.wmnet with OS bookworm
[23:08:13] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10585249 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2013.codfw.wmnet with OS bookworm executed with errors: - backu...
[23:08:23] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.004e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[23:12:15] <wikibugs>	 (03CR) 10Ladsgroup: "Thanks. I think we need this added for testcommonswiki too (unless we are planning to drop it once done)?" [puppet] - 10https://gerrit.wikimedia.org/r/1123022 (https://phabricator.wikimedia.org/T363581) (owner: 10Bvibber)
[23:15:14] <wikibugs>	 (03CR) 10Ladsgroup: "Thanks I will try to get this deployed soon" [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04)
[23:18:45] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host backup2013.codfw.wmnet with OS bookworm
[23:18:54] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10585282 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2013.codfw.wmnet with OS bookworm
[23:21:20] <wikibugs>	 (03PS1) 10Kimberly Sarabia: Disable donate link in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123046 (https://phabricator.wikimedia.org/T386767)
[23:22:25] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.008e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[23:22:38] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2088.codfw.wmnet with OS bookworm
[23:26:37] <wikibugs>	 (03CR) 10Jdlrobson: [C:04-1] Disable donate link in beta (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123046 (https://phabricator.wikimedia.org/T386767) (owner: 10Kimberly Sarabia)
[23:28:26] <wikibugs>	 (03PS2) 10Kimberly Sarabia: Disable donate link in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123046 (https://phabricator.wikimedia.org/T386767)
[23:31:55] <wikibugs>	 (03CR) 10Jdlrobson: [C:04-1] Disable donate link in beta (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123046 (https://phabricator.wikimedia.org/T386767) (owner: 10Kimberly Sarabia)
[23:36:16] <wikibugs>	 (03PS3) 10Kimberly Sarabia: Disable donate link in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123046 (https://phabricator.wikimedia.org/T386767)
[23:41:14] <wikibugs>	 (03CR) 10Kimberly Sarabia: Disable donate link in beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123046 (https://phabricator.wikimedia.org/T386767) (owner: 10Kimberly Sarabia)
[23:42:10] <wikibugs>	 (03PS1) 10Effie Mouzeli: WIP: introduce mw-experimental functionality [puppet] - 10https://gerrit.wikimedia.org/r/1123048 (https://phabricator.wikimedia.org/T276994)
[23:42:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP: introduce mw-experimental functionality [puppet] - 10https://gerrit.wikimedia.org/r/1123048 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli)
[23:55:19] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Unable to restore File:Blason_famille_fr_de-Lichy_(2).svg - https://phabricator.wikimedia.org/T387340#10585396 (10Pppery) What have happened on the MediaWiki side is:  ` 13:12, 25 February 2025 Sreejithk2000 talk contribs moved page File:Blason fam...