[00:00:13] !log cwhite@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: logstash1029.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - cwhite@cumin2002" [00:00:35] !log cwhite@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: logstash1029.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - cwhite@cumin2002" [00:00:35] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:00:36] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts logstash1029.eqiad.wmnet [00:00:38] !log cwhite@cumin2002 START - Cookbook sre.dns.netbox [00:03:02] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:03:03] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts logstash1027.eqiad.wmnet [00:03:35] !log cwhite@cumin2002 START - Cookbook sre.dns.netbox [00:04:31] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.267s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:05:56] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:05:57] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts logstash1028.eqiad.wmnet [00:06:34] !log cwhite@cumin2002 START - Cookbook sre.hosts.decommission for hosts logstash1026.eqiad.wmnet [00:09:31] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.17s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:10:57] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 639.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:12:55] !log cwhite@cumin2002 START - Cookbook sre.dns.netbox [00:13:19] (03PS1) 10Eevans: sessionstore: upgrade to 'dev' (Cassandra 4.1.8) [puppet] - 10https://gerrit.wikimedia.org/r/1122695 (https://phabricator.wikimedia.org/T386969) [00:14:01] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122695 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans) [00:19:40] !log cwhite@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: logstash1026.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - cwhite@cumin2002" [00:20:16] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.412s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:23:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1151:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1151 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:25:16] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.12s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:25:46] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.042s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:30:31] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.392s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:33:59] !log cwhite@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: logstash1026.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - cwhite@cumin2002" [00:33:59] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:34:00] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts logstash1026.eqiad.wmnet [00:34:24] (03PS1) 10Tim Starling: CodeMirror: use the EditorView's state property on form submission [extensions/CodeMirror] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1122697 (https://phabricator.wikimedia.org/T387253) [00:35:31] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.492s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:35:53] (03CR) 10Tim Starling: [C:03+2] CodeMirror: use the EditorView's state property on form submission [extensions/CodeMirror] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1122697 (https://phabricator.wikimedia.org/T387253) (owner: 10Tim Starling) [00:38:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1122698 [00:38:29] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1122698 (owner: 10TrainBranchBot) [00:43:16] (03Merged) 10jenkins-bot: CodeMirror: use the EditorView's state property on form submission [extensions/CodeMirror] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1122697 (https://phabricator.wikimedia.org/T387253) (owner: 10Tim Starling) [00:44:45] !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1122697|CodeMirror: use the EditorView's state property on form submission (T387253)]] [00:44:49] T387253: Codemirror broken and doesn't recognize changes - https://phabricator.wikimedia.org/T387253 [00:45:40] (03PS2) 10Eevans: sessionstore: upgrade to 'dev' (Cassandra 4.1.8) [puppet] - 10https://gerrit.wikimedia.org/r/1122695 (https://phabricator.wikimedia.org/T386969) [00:46:25] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122695 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans) [00:47:48] !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1122697|CodeMirror: use the EditorView's state property on form submission (T387253)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [00:51:13] !log tstarling@deploy2002 tstarling: Continuing with sync [00:51:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.019s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:51:21] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.196 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:51:26] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1122698 (owner: 10TrainBranchBot) [00:55:32] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122695 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans) [00:56:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.152s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:57:47] !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122697|CodeMirror: use the EditorView's state property on form submission (T387253)]] (duration: 13m 02s) [00:57:51] T387253: Codemirror broken and doesn't recognize changes - https://phabricator.wikimedia.org/T387253 [01:00:23] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 94, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:00:31] 06SRE, 10Bitu, 06Infrastructure-Foundations: Create an IDM for Wikimedia developer accounts - https://phabricator.wikimedia.org/T319405#10581265 (10nshahquinn-wmf) 05Open→03Resolved IDM has been up and running for a long time now, so unless I'm missing something, this is done. [01:03:56] (03PS1) 10Bartosz Dziewoński: Remove $wmgUseGraphWithJsonNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122709 (https://phabricator.wikimedia.org/T124748) [01:04:53] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host backup1014.eqiad.wmnet with OS bookworm [01:05:06] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10581281 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm [01:05:06] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1014.eqiad.wmnet with OS bookworm [01:05:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10581283 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm executed with errors: - backup1... [01:05:35] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:05:53] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:06:29] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:08:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1122710 [01:08:33] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1122710 (owner: 10TrainBranchBot) [01:09:07] !log removing 4 files for legal compliance [01:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:19] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:09:27] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:09:43] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53514 bytes in 0.144 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:17:00] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host backup1014.eqiad.wmnet with OS bookworm [01:17:11] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1014.eqiad.wmnet with OS bookworm [01:17:12] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10581290 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm [01:17:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10581291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm executed with errors: - backup1... [01:18:11] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host backup1014.eqiad.wmnet with OS bookworm [01:18:21] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10581297 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm [01:19:30] !log removing 1 file for legal compliance [01:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:46] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10581299 (10Neobeta61) According to the MR Functional Spec import foreign drive happens 'at boot'. But I used the restart command 'stor... [01:23:47] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:30:02] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1122710 (owner: 10TrainBranchBot) [01:50:28] (03PS1) 10Bartosz Dziewoński: Deduplicate JsonConfig config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 [01:54:39] (03CR) 10Bartosz Dziewoński: "Recently touched in:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 (owner: 10Bartosz Dziewoński) [01:55:41] (03CR) 10Reedy: "recheck" [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1122710 (owner: 10TrainBranchBot) [01:55:48] (03CR) 10Reedy: "resubmit" [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1122710 (owner: 10TrainBranchBot) [01:56:31] (03CR) 10Reedy: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1122710 (owner: 10TrainBranchBot) [02:00:03] (03PS2) 10Bartosz Dziewoński: Deduplicate JsonConfig config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 [02:12:59] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:13:31] (03PS1) 10Arlolra: Turn on Parsoid Read Views for XX wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122712 (https://phabricator.wikimedia.org/T387254) [02:17:56] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1122710 (owner: 10TrainBranchBot) [02:18:33] (03PS2) 10Arlolra: Turn on Parsoid Read Views for XX wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122712 (https://phabricator.wikimedia.org/T387254) [02:32:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1165:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1165 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:33:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1165:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1165 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:47:35] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1014.eqiad.wmnet with OS bookworm [02:47:44] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10581369 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm executed with errors: - backup1... [02:56:43] RECOVERY - snapshot of s8 in codfw on backupmon1001 is OK: Last snapshot for s8 at codfw (db2198) taken on 2025-02-26 01:44:59 (1176 GiB, +1.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:59:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1165:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1165 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:03:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.163s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:03:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:04:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1165:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1165 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:08:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.146s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:12:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.152s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:17:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.059s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:19:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 826.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:24:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 939.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:32:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.053s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:37:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.079s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:54:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.08s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:59:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.237s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:59:45] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 977.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:04:30] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.221s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:07:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.152s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:13:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:17:15] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.329s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:40:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.324s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:50:15] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.097s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:53:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.371s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:03:15] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.081s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:03:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:04:17] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:04:17] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:09:45] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Idle - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:12:15] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:12:15] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:17:41] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:23:47] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:22:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1195 db2153', diff saved to https://phabricator.wikimedia.org/P73602 and previous config saved to /var/cache/conftool/dbconfig/20250226-062234-root.json [06:22:54] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2153.codfw.wmnet [06:23:00] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1195.eqiad.wmnet [06:24:33] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2220 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1122720 (https://phabricator.wikimedia.org/T387270) [06:24:37] (03PS1) 10Gerrit maintenance bot: wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1122721 (https://phabricator.wikimedia.org/T387270) [06:25:23] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T387270 [06:25:28] T387270: Switchover s7 master (db2218 -> db2220) - https://phabricator.wikimedia.org/T387270 [06:25:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2220 with weight 0 T387270', diff saved to https://phabricator.wikimedia.org/P73603 and previous config saved to /var/cache/conftool/dbconfig/20250226-062535-marostegui.json [06:26:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2220 from API/vslow/dump T387270', diff saved to https://phabricator.wikimedia.org/P73604 and previous config saved to /var/cache/conftool/dbconfig/20250226-062617-marostegui.json [06:26:44] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2220 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1122720 (https://phabricator.wikimedia.org/T387270) (owner: 10Gerrit maintenance bot) [06:30:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2153.codfw.wmnet [06:30:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1195.eqiad.wmnet [06:30:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1195.eqiad.wmnet with reason: Index rebuild [06:30:42] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2153.codfw.wmnet with reason: Index rebuild [06:38:05] !log Starting s7 codfw failover from db2218 to db2220 - T387270 [06:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:09] T387270: Switchover s7 master (db2218 -> db2220) - https://phabricator.wikimedia.org/T387270 [06:38:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s7 codfw as read-only for maintenance - T387270', diff saved to https://phabricator.wikimedia.org/P73605 and previous config saved to /var/cache/conftool/dbconfig/20250226-063817-root.json [06:38:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2220 to s7 primary and set section read-write T387270', diff saved to https://phabricator.wikimedia.org/P73606 and previous config saved to /var/cache/conftool/dbconfig/20250226-063844-root.json [06:39:30] (03CR) 10Marostegui: [C:03+2] wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1122721 (https://phabricator.wikimedia.org/T387270) (owner: 10Gerrit maintenance bot) [06:39:38] !log marostegui@dns1006 START - running authdns-update [06:40:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2218 T387270', diff saved to https://phabricator.wikimedia.org/P73607 and previous config saved to /var/cache/conftool/dbconfig/20250226-064009-marostegui.json [06:41:38] !log marostegui@dns1006 END - running authdns-update [06:42:01] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2218.codfw.wmnet [06:44:24] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1015.eqiad.wmnet with reason: Index rebuild [06:45:07] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1173.eqiad.wmnet [06:45:19] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2229.codfw.wmnet [06:48:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2218.codfw.wmnet [06:48:45] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2218.codfw.wmnet with reason: Index rebuild [06:50:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1037', diff saved to https://phabricator.wikimedia.org/P73608 and previous config saved to /var/cache/conftool/dbconfig/20250226-065054-marostegui.json [06:51:16] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for es1037.eqiad.wmnet [06:51:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2229.codfw.wmnet [06:51:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1173.eqiad.wmnet [06:52:41] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1173.eqiad.wmnet with reason: Index rebuild [06:52:43] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:52:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2229.codfw.wmnet with reason: Index rebuild [06:56:28] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1179.eqiad.wmnet with reason: Maintenance [06:56:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1179 (T385645)', diff saved to https://phabricator.wikimedia.org/P73609 and previous config saved to /var/cache/conftool/dbconfig/20250226-065634-marostegui.json [06:56:39] T385645: Drop event_variant column from echo_event - https://phabricator.wikimedia.org/T385645 [06:57:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T385645)', diff saved to https://phabricator.wikimedia.org/P73610 and previous config saved to /var/cache/conftool/dbconfig/20250226-065742-marostegui.json [06:58:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es1037.eqiad.wmnet [06:59:25] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1037.eqiad.wmnet with reason: Maintenance [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T0700) [07:01:10] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:04:10] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:08:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73611 and previous config saved to /var/cache/conftool/dbconfig/20250226-070841-root.json [07:10:40] (03PS1) 10Marostegui: installserver: Do not reimage db1253 [puppet] - 10https://gerrit.wikimedia.org/r/1122859 [07:12:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P73612 and previous config saved to /var/cache/conftool/dbconfig/20250226-071248-marostegui.json [07:13:06] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage db1253 [puppet] - 10https://gerrit.wikimedia.org/r/1122859 (owner: 10Marostegui) [07:23:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73613 and previous config saved to /var/cache/conftool/dbconfig/20250226-072347-root.json [07:24:53] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es1035 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1122860 (https://phabricator.wikimedia.org/T387271) [07:26:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es1035 with weight 0 T387271', diff saved to https://phabricator.wikimedia.org/P73614 and previous config saved to /var/cache/conftool/dbconfig/20250226-072600-root.json [07:26:04] T387271: Switchover es7 master (es1039 -> es1035) - https://phabricator.wikimedia.org/T387271 [07:26:05] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es7 T387271 [07:26:56] (03CR) 10Marostegui: [C:03+2] mariadb: Promote es1035 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1122860 (https://phabricator.wikimedia.org/T387271) (owner: 10Gerrit maintenance bot) [07:27:26] !log Starting es7 eqiad failover from es1039 to es1035 - T387271 [07:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1035 to es7 primary and set section read-write T387271', diff saved to https://phabricator.wikimedia.org/P73615 and previous config saved to /var/cache/conftool/dbconfig/20250226-072751-root.json [07:28:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P73616 and previous config saved to /var/cache/conftool/dbconfig/20250226-072802-marostegui.json [07:28:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1039 T387271', diff saved to https://phabricator.wikimedia.org/P73617 and previous config saved to /var/cache/conftool/dbconfig/20250226-072845-root.json [07:29:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Give some weight to es1035', diff saved to https://phabricator.wikimedia.org/P73618 and previous config saved to /var/cache/conftool/dbconfig/20250226-072908-marostegui.json [07:30:41] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for es1039.eqiad.wmnet [07:36:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es1039.eqiad.wmnet [07:37:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1024.eqiad.wmnet [07:37:52] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10581646 (10ops-monitoring-bot) Draining ganeti1024.eqiad.wmnet of running VMs [07:38:23] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1039.eqiad.wmnet with reason: maintenance [07:38:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73619 and previous config saved to /var/cache/conftool/dbconfig/20250226-073852-root.json [07:39:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73620 and previous config saved to /var/cache/conftool/dbconfig/20250226-073901-root.json [07:41:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1024.eqiad.wmnet [07:41:47] (03CR) 10Filippo Giunchedi: [C:03+1] workaround for T256098 [debs/benthos] - 10https://gerrit.wikimedia.org/r/1122557 (https://phabricator.wikimedia.org/T256098) (owner: 10Fabfur) [07:42:28] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1024.eqiad.wmnet [07:42:53] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es1037 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1122894 (https://phabricator.wikimedia.org/T387273) [07:43:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T385645)', diff saved to https://phabricator.wikimedia.org/P73621 and previous config saved to /var/cache/conftool/dbconfig/20250226-074309-marostegui.json [07:43:13] T385645: Drop event_variant column from echo_event - https://phabricator.wikimedia.org/T385645 [07:43:25] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1216.eqiad.wmnet with reason: Maintenance [07:43:28] (03PS1) 10Muehlenhoff: Switch ganeti1024 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1122895 [07:43:41] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1224.eqiad.wmnet with reason: Maintenance [07:43:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1224 (T385645)', diff saved to https://phabricator.wikimedia.org/P73622 and previous config saved to /var/cache/conftool/dbconfig/20250226-074347-marostegui.json [07:44:23] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10581670 (10ops-monitoring-bot) Draining ganeti1024.eqiad.wmnet of running VMs [07:44:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T385645)', diff saved to https://phabricator.wikimedia.org/P73623 and previous config saved to /var/cache/conftool/dbconfig/20250226-074455-marostegui.json [07:52:12] (03Abandoned) 10Slyngshede: C:prometheus::process_exporter Add a simplistic process exporter. [puppet] - 10https://gerrit.wikimedia.org/r/1004672 (owner: 10Slyngshede) [07:52:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73624 and previous config saved to /var/cache/conftool/dbconfig/20250226-075251-root.json [07:53:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1224 T385645', diff saved to https://phabricator.wikimedia.org/P73625 and previous config saved to /var/cache/conftool/dbconfig/20250226-075347-root.json [07:53:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73626 and previous config saved to /var/cache/conftool/dbconfig/20250226-075357-root.json [07:54:01] T385645: Drop event_variant column from echo_event - https://phabricator.wikimedia.org/T385645 [07:54:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73627 and previous config saved to /var/cache/conftool/dbconfig/20250226-075406-root.json [07:56:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73628 and previous config saved to /var/cache/conftool/dbconfig/20250226-075629-root.json [08:00:04] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T0800). [08:00:04] No Gerrit patches in the queue for this window AFAICS. [08:06:29] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:06:29] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:06:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2229 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73629 and previous config saved to /var/cache/conftool/dbconfig/20250226-080656-root.json [08:07:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73630 and previous config saved to /var/cache/conftool/dbconfig/20250226-080757-root.json [08:09:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73631 and previous config saved to /var/cache/conftool/dbconfig/20250226-080903-root.json [08:09:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73632 and previous config saved to /var/cache/conftool/dbconfig/20250226-080911-root.json [08:11:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73633 and previous config saved to /var/cache/conftool/dbconfig/20250226-081134-root.json [08:22:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2229 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73634 and previous config saved to /var/cache/conftool/dbconfig/20250226-082201-root.json [08:23:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73635 and previous config saved to /var/cache/conftool/dbconfig/20250226-082302-root.json [08:23:14] (03CR) 10Brouberol: [C:03+1] wdqs: Create DNS entry for one full graph host [dns] - 10https://gerrit.wikimedia.org/r/1122676 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [08:24:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73636 and previous config saved to /var/cache/conftool/dbconfig/20250226-082417-root.json [08:26:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73637 and previous config saved to /var/cache/conftool/dbconfig/20250226-082640-root.json [08:27:45] (03CR) 10Vgutierrez: hiera: Reimage lvs7002 as liberica LB [puppet] - 10https://gerrit.wikimedia.org/r/1122623 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [08:28:20] (03CR) 10Volans: Expose _gql_execute to wmf-netbox (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [08:30:48] (03CR) 10Vgutierrez: [C:03+2] hiera: Reimage lvs7002 as liberica LB [puppet] - 10https://gerrit.wikimedia.org/r/1122623 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [08:31:35] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:32:55] (03CR) 10Jelto: [C:03+2] gerrit: give it more time to terminate [puppet] - 10https://gerrit.wikimedia.org/r/1112011 (https://phabricator.wikimedia.org/T323754) (owner: 10Hashar) [08:33:34] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs7002.magru.wmnet with OS bookworm [08:34:21] (03CR) 10Volans: Netbox: fetch GQL queries from files (033 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1122133 (owner: 10Ayounsi) [08:36:59] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:37:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2229 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73638 and previous config saved to /var/cache/conftool/dbconfig/20250226-083706-root.json [08:37:44] (03CR) 10Jelto: [V:03+1 C:03+2] sre.gitlab.upgrade: add a prompt before backups on replica [cookbooks] - 10https://gerrit.wikimedia.org/r/1122520 (owner: 10Jelto) [08:38:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73639 and previous config saved to /var/cache/conftool/dbconfig/20250226-083807-root.json [08:39:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73640 and previous config saved to /var/cache/conftool/dbconfig/20250226-083922-root.json [08:39:35] BGP alert isme reimaging lvs7002 [08:41:42] FIRING: JobUnavailable: Reduced availability for job pybal in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:41:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73641 and previous config saved to /var/cache/conftool/dbconfig/20250226-084145-root.json [08:44:15] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: add a prompt before backups on replica [cookbooks] - 10https://gerrit.wikimedia.org/r/1122520 (owner: 10Jelto) [08:47:00] (03CR) 10Federico Ceratto: [C:03+2] clone.py: Add helper functions for later use [cookbooks] - 10https://gerrit.wikimedia.org/r/1120213 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [08:47:04] (03CR) 10Elukey: "Yes definitely!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan) [08:47:50] (03CR) 10Federico Ceratto: [C:03+2] clone.py: Cleanup, extract fqdn and hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/1120214 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [08:48:57] (03PS1) 10Slyngshede: C:idm::deployment cleanup expired signup objects [puppet] - 10https://gerrit.wikimedia.org/r/1122898 [08:48:59] (03PS1) 10Hashar: gerrit: remove explicit UseG1GC flag [puppet] - 10https://gerrit.wikimedia.org/r/1122899 (https://phabricator.wikimedia.org/T387223) [08:49:33] (03PS2) 10Elukey: kserve-inference: remove the need for the kserve container's securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122636 (https://phabricator.wikimedia.org/T369493) [08:51:24] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4983/console" [puppet] - 10https://gerrit.wikimedia.org/r/1122898 (owner: 10Slyngshede) [08:51:42] RESOLVED: JobUnavailable: Reduced availability for job pybal in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:51:42] (03CR) 10Jelto: "looks mostly good, two comments in-line" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122678 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [08:52:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2229 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73642 and previous config saved to /var/cache/conftool/dbconfig/20250226-085212-root.json [08:52:13] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4985/console" [puppet] - 10https://gerrit.wikimedia.org/r/1122898 (owner: 10Slyngshede) [08:53:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73643 and previous config saved to /var/cache/conftool/dbconfig/20250226-085312-root.json [08:55:03] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs7002.magru.wmnet with reason: host reimage [08:55:39] (03CR) 10Jelto: wdqs: add routing for legacy full graph host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [08:56:20] (03PS2) 10Slyngshede: C:idm::deployment cleanup expired signup objects [puppet] - 10https://gerrit.wikimedia.org/r/1122898 [08:56:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73644 and previous config saved to /var/cache/conftool/dbconfig/20250226-085650-root.json [08:57:15] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4987/co" [puppet] - 10https://gerrit.wikimedia.org/r/1122898 (owner: 10Slyngshede) [08:58:37] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs7002.magru.wmnet with reason: host reimage [08:59:28] (03PS3) 10Slyngshede: C:idm::deployment cleanup expired signup objects [puppet] - 10https://gerrit.wikimedia.org/r/1122898 [09:00:04] dduvall and andre: Time to do the MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T0900). [09:01:13] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2075.codfw.wmnet with OS bullseye [09:01:13] (03PS6) 10Volans: Fix CI reported issues [software/homer] - 10https://gerrit.wikimedia.org/r/1121370 (owner: 10Ayounsi) [09:01:19] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10581801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye [09:01:29] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1019.eqiad.wmnet with reason: Index rebuild [09:01:38] (03CR) 10Volans: "@Arzhel, as agreed on IRC I took over the patch and fixed all the reported issues." [software/homer] - 10https://gerrit.wikimedia.org/r/1121370 (owner: 10Ayounsi) [09:01:44] (03PS4) 10Slyngshede: C:idm::deployment cleanup expired signup objects [puppet] - 10https://gerrit.wikimedia.org/r/1122898 [09:02:30] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4991/co" [puppet] - 10https://gerrit.wikimedia.org/r/1122898 (owner: 10Slyngshede) [09:03:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1180 db2169', diff saved to https://phabricator.wikimedia.org/P73645 and previous config saved to /var/cache/conftool/dbconfig/20250226-090323-root.json [09:03:35] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1180.eqiad.wmnet [09:04:38] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4992/co" [puppet] - 10https://gerrit.wikimedia.org/r/1122898 (owner: 10Slyngshede) [09:06:57] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2075.codfw.wmnet with OS bullseye [09:07:06] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10581814 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors: - ms-be2075... [09:07:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2229 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73646 and previous config saved to /var/cache/conftool/dbconfig/20250226-090717-root.json [09:07:26] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2169.codfw.wmnet [09:08:34] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2075.codfw.wmnet with OS bullseye [09:08:48] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10581816 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye [09:09:01] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:09:51] (03CR) 10Ayounsi: Expose _gql_execute to wmf-netbox (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [09:10:02] (03PS1) 10Hashar: php: use component/pcre2 when using Php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1122901 (https://phabricator.wikimedia.org/T387276) [09:10:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1180.eqiad.wmnet [09:10:35] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4995/co" [puppet] - 10https://gerrit.wikimedia.org/r/1122898 (owner: 10Slyngshede) [09:10:55] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1180.eqiad.wmnet with reason: Index rebuild [09:11:46] (03CR) 10Volans: Expose _gql_execute to wmf-netbox (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [09:13:08] (03PS1) 10Jgiannelos: pcs: Enable more wikis for native PCS pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122902 [09:13:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2169.codfw.wmnet [09:13:51] (03CR) 10Jgiannelos: "I added another round of rollouts with more wikis this time." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122902 (owner: 10Jgiannelos) [09:14:03] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:14:16] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2169.codfw.wmnet with reason: Index rebuild [09:16:01] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:16:04] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2075.codfw.wmnet with OS bullseye [09:16:11] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10581834 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors: - ms-be2075... [09:18:59] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs7002.magru.wmnet with OS bookworm [09:19:18] (03PS2) 10Jgiannelos: pcs: Enable more wikis for native PCS pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122902 (https://phabricator.wikimedia.org/T387277) [09:20:01] (03PS3) 10Ayounsi: Expose _gql_execute to wmf-netbox + fetch GQL queries from files [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 [09:20:04] (03PS2) 10Hashar: php: use component/pcre2 when using Php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1122901 (https://phabricator.wikimedia.org/T387276) [09:23:47] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:23:55] (03PS4) 10Ayounsi: Expose _gql_execute to wmf-netbox + fetch GQL queries from files [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 [09:24:26] (03CR) 10Ayounsi: "Commit squashed in previous one. Replying to comments here." [software/homer] - 10https://gerrit.wikimedia.org/r/1122133 (owner: 10Ayounsi) [09:26:00] (03CR) 10Ayounsi: [C:03+1] "Thanks!" [software/homer] - 10https://gerrit.wikimedia.org/r/1121370 (owner: 10Ayounsi) [09:26:23] (03CR) 10Volans: [C:03+2] Fix CI reported issues [software/homer] - 10https://gerrit.wikimedia.org/r/1121370 (owner: 10Ayounsi) [09:28:20] (03CR) 10CI reject: [V:04-1] Expose _gql_execute to wmf-netbox + fetch GQL queries from files [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [09:32:43] (03Merged) 10jenkins-bot: Fix CI reported issues [software/homer] - 10https://gerrit.wikimedia.org/r/1121370 (owner: 10Ayounsi) [09:35:03] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2075.codfw.wmnet with OS bullseye [09:35:13] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10581870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye [09:39:13] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2075.codfw.wmnet with OS bullseye [09:39:19] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10581872 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors: - ms-be2075... [09:41:42] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10581888 (10elukey) I tried multiple times to run reimage but the host doesn't PXE boot, not sure why, I tried to follow the console com2 as well but no clear error highlighted. [09:44:46] (03PS2) 10Muehlenhoff: php: use component/pcre2 when using Php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1122901 (https://phabricator.wikimedia.org/T387276) (owner: 10Hashar) [09:47:47] (03PS1) 10Vgutierrez: hiera: restore lvs7002 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1122905 (https://phabricator.wikimedia.org/T384477) [09:52:43] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122905 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [09:52:44] !log Restarting Gerrit on gerrit2002 and gerrit1003 [09:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:40] Feb 26 08:37:17 gerrit1003 systemd[1]: /lib/systemd/system/gerrit.service:16: Unknown key name 'TimeOutStopSec' in section 'Service', ignoring. [09:53:41] pff :/ [09:54:22] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10581920 (10MoritzMuehlenhoff) >>! In T383723#10579629, @Andrew wrote: > @MoritzMuehlenhoff ping, is ganeti1044 ready to be moved? Not yet, I'l... [09:54:27] anyway it has restarted [09:56:30] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122905 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [09:57:01] (03PS1) 10Hashar: gerrit: fix systemd service TimeoutStopSec [puppet] - 10https://gerrit.wikimedia.org/r/1122906 (https://phabricator.wikimedia.org/T323754) [09:58:13] (03CR) 10Hashar: gerrit: give it more time to terminate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1112011 (https://phabricator.wikimedia.org/T323754) (owner: 10Hashar) [09:58:31] hashar: I'm already preparing a fix to change the timeout steting [09:58:59] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1122906 (https://phabricator.wikimedia.org/T323754) (owner: 10Hashar) [09:59:09] (03CR) 10Jelto: [C:03+2] "ah, you are faster, this looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1122906 (https://phabricator.wikimedia.org/T323754) (owner: 10Hashar) [09:59:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73647 and previous config saved to /var/cache/conftool/dbconfig/20250226-095917-root.json [10:00:16] (03CR) 10Vgutierrez: [C:03+2] hiera: restore lvs7002 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1122905 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [10:01:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73648 and previous config saved to /var/cache/conftool/dbconfig/20250226-100148-root.json [10:02:26] (03CR) 10Vgutierrez: hiera: Reimage lvs7001 as liberica LB [puppet] - 10https://gerrit.wikimedia.org/r/1122624 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [10:02:34] (03PS2) 10Vgutierrez: hiera: Reimage lvs7001 as liberica LB [puppet] - 10https://gerrit.wikimedia.org/r/1122624 (https://phabricator.wikimedia.org/T384477) [10:05:25] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122624 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [10:08:55] !log depooling lvs7001 before reimaging - T384477 [10:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:59] T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477 [10:09:25] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [debs/benthos] - 10https://gerrit.wikimedia.org/r/1122557 (https://phabricator.wikimedia.org/T256098) (owner: 10Fabfur) [10:09:30] (03CR) 10Vgutierrez: [C:03+2] hiera: Reimage lvs7001 as liberica LB [puppet] - 10https://gerrit.wikimedia.org/r/1122624 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [10:11:15] (03CR) 10Hashar: php: use component/pcre2 when using Php 8.1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1122901 (https://phabricator.wikimedia.org/T387276) (owner: 10Hashar) [10:11:26] (03PS3) 10Hashar: php: use component/pcre2 when using Php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1122901 (https://phabricator.wikimedia.org/T387276) [10:11:35] PROBLEM - pybal on lvs7001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [10:11:43] thanks moritzm , I was just contacting you for suggestion on that! :D [10:11:59] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:12:11] ^^ that's lvs7001 depooled [10:12:15] PROBLEM - PyBal backends health check on lvs7001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [10:12:58] :-) [10:13:19] PROBLEM - PyBal connections to etcd on lvs7001 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [10:13:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:13:47] (03CR) 10Fabfur: [C:03+2] workaround for T256098 [debs/benthos] - 10https://gerrit.wikimedia.org/r/1122557 (https://phabricator.wikimedia.org/T256098) (owner: 10Fabfur) [10:13:51] (03CR) 10Fabfur: [V:03+2 C:03+2] workaround for T256098 [debs/benthos] - 10https://gerrit.wikimedia.org/r/1122557 (https://phabricator.wikimedia.org/T256098) (owner: 10Fabfur) [10:14:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73649 and previous config saved to /var/cache/conftool/dbconfig/20250226-101401-root.json [10:14:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73650 and previous config saved to /var/cache/conftool/dbconfig/20250226-101422-root.json [10:15:42] FIRING: JobUnavailable: Reduced availability for job pybal in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:16:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73651 and previous config saved to /var/cache/conftool/dbconfig/20250226-101654-root.json [10:17:49] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10581998 (10cmooney) >>! In T384731#10579181, @ayounsi wrote: >>> And what happens if peer_descr is missing or empt... [10:18:08] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs7001.magru.wmnet with OS bookworm [10:24:32] (03CR) 10Hnowlan: [C:03+1] pcs: Enable more wikis for native PCS pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122902 (https://phabricator.wikimedia.org/T387277) (owner: 10Jgiannelos) [10:28:27] (03CR) 10AikoChou: [C:03+1] kserve-inference: remove the need for the kserve container's securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122636 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [10:29:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73652 and previous config saved to /var/cache/conftool/dbconfig/20250226-102906-root.json [10:29:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73653 and previous config saved to /var/cache/conftool/dbconfig/20250226-102927-root.json [10:32:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73654 and previous config saved to /var/cache/conftool/dbconfig/20250226-103159-root.json [10:36:12] (03CR) 10Clément Goubert: mwscript: do not run mesh checks when running in a loop (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1122606 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [10:36:58] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Index rebuild [10:39:10] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs7001.magru.wmnet with reason: host reimage [10:39:32] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Index rebuild [10:42:21] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs7001.magru.wmnet with reason: host reimage [10:42:35] (03PS1) 10Fabfur: benthos: use hasty mode to avoid eventgate blocking http requests [puppet] - 10https://gerrit.wikimedia.org/r/1122917 (https://phabricator.wikimedia.org/T329332) [10:43:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:44:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73655 and previous config saved to /var/cache/conftool/dbconfig/20250226-104359-root.json [10:44:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73656 and previous config saved to /var/cache/conftool/dbconfig/20250226-104411-root.json [10:44:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73658 and previous config saved to /var/cache/conftool/dbconfig/20250226-104433-root.json [10:46:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73659 and previous config saved to /var/cache/conftool/dbconfig/20250226-104619-root.json [10:46:25] (03CR) 10Vgutierrez: [C:03+1] benthos: use hasty mode to avoid eventgate blocking http requests [puppet] - 10https://gerrit.wikimedia.org/r/1122917 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [10:47:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73660 and previous config saved to /var/cache/conftool/dbconfig/20250226-104704-root.json [10:48:26] (03CR) 10Stevemunene: [C:03+1] airflow-analytics-product: migrate the scheduler and the DB to Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122591 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol) [10:49:01] (03CR) 10Stevemunene: [C:03+1] airflow-analytics-product: disable and remove the airflow systemd services [puppet] - 10https://gerrit.wikimedia.org/r/1122592 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol) [10:50:42] RESOLVED: JobUnavailable: Reduced availability for job pybal in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:51:49] (03CR) 10Fabfur: [C:03+2] benthos: use hasty mode to avoid eventgate blocking http requests [puppet] - 10https://gerrit.wikimedia.org/r/1122917 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [10:52:05] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:52:25] (03PS1) 10Gkyziridis: inference-services: deployment for edit-check dummy model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) [10:52:41] (03PS1) 10Vgutierrez: hiera: Restore lvs7001 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1122919 (https://phabricator.wikimedia.org/T384477) [10:53:35] (03CR) 10CI reject: [V:04-1] inference-services: deployment for edit-check dummy model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [10:54:15] (03CR) 10Volans: "The other CR might be abandoned at this point I guess" [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [10:55:50] 06SRE, 10MW-on-K8s, 06serviceops, 10Release-Engineering-Team (Priority Backlog 📥): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629#10582102 (10JMeybohm) I stumbled upon this again recently and I think the current configuration does not allow pod creation at... [10:59:05] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:59:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73661 and previous config saved to /var/cache/conftool/dbconfig/20250226-105905-root.json [10:59:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73662 and previous config saved to /var/cache/conftool/dbconfig/20250226-105916-root.json [10:59:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73663 and previous config saved to /var/cache/conftool/dbconfig/20250226-105937-root.json [10:59:53] !log Drop schema change on s3 codfw master with replication dbmaint T385645 [10:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:56] T385645: Drop event_variant column from echo_event - https://phabricator.wikimedia.org/T385645 [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T1100) [11:01:05] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:01:17] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10582145 (10ayounsi) I forked the discussion to {T387287} and {T387288} as that task was becoming more difficult to... [11:01:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73664 and previous config saved to /var/cache/conftool/dbconfig/20250226-110124-root.json [11:01:37] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10582148 (10ayounsi) [11:02:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73665 and previous config saved to /var/cache/conftool/dbconfig/20250226-110209-root.json [11:03:37] !log Drop schema change on s7 codfw master with replication dbmaint T385645 [11:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:14] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs7001.magru.wmnet with OS bookworm [11:04:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Schema change [11:14:05] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122919 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [11:14:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73666 and previous config saved to /var/cache/conftool/dbconfig/20250226-111410-root.json [11:14:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73667 and previous config saved to /var/cache/conftool/dbconfig/20250226-111421-root.json [11:15:41] (03CR) 10Vgutierrez: [C:03+2] hiera: Restore lvs7001 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1122919 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [11:16:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73668 and previous config saved to /var/cache/conftool/dbconfig/20250226-111629-root.json [11:20:12] FIRING: ProbeDown: Service aux-k8s-ctrl1002:6443 has failed probes (http_aux_k8s_eqiad_kube_apiserver_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#aux-k8s-ctrl1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:20:25] (03CR) 10Slyngshede: [C:04-1] "Change of plans, sorry. This will be rolled into https://phabricator.wikimedia.org/T341581." [puppet] - 10https://gerrit.wikimedia.org/r/1070563 (https://phabricator.wikimedia.org/T373702) (owner: 10Slyngshede) [11:20:28] !incidents [11:20:28] 5699 (UNACKED) ProbeDown sre (2620:0:861:101:10:64:0:107 ip6 aux-k8s-ctrl1002:6443 probes/custom http_aux_k8s_eqiad_kube_apiserver_ip6 eqiad) [11:20:31] !ack 5699 [11:20:31] 5699 (ACKED) ProbeDown sre (2620:0:861:101:10:64:0:107 ip6 aux-k8s-ctrl1002:6443 probes/custom http_aux_k8s_eqiad_kube_apiserver_ip6 eqiad) [11:20:50] I'm guessing that's unexpected? :) [11:21:07] nothing obvious in the above lines here...so maaaybe :) [11:21:46] !log repooling lvs7001 running liberica - T384477 [11:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:50] T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477 [11:22:07] but it's reachable via ipv4 at least [11:25:12] RESOLVED: ProbeDown: Service aux-k8s-ctrl1002:6443 has failed probes (http_aux_k8s_eqiad_kube_apiserver_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#aux-k8s-ctrl1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:25:30] and back to normal [11:26:37] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:26:43] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:29:10] (03PS1) 10Effie Mouzeli: shellbox: all replicas on PHP 8.1 (media & timeline) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122923 (https://phabricator.wikimedia.org/T377038) [11:29:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73669 and previous config saved to /var/cache/conftool/dbconfig/20250226-112915-root.json [11:31:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73670 and previous config saved to /var/cache/conftool/dbconfig/20250226-113134-root.json [11:32:22] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: sync [11:32:27] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: sync [11:33:43] (03CR) 10Gkyziridis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [11:33:45] (03CR) 10Gkyziridis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [11:34:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1168 db2180 T386242', diff saved to https://phabricator.wikimedia.org/P73671 and previous config saved to /var/cache/conftool/dbconfig/20250226-113453-root.json [11:34:58] T386242: Upgrade and rebuild s6 - https://phabricator.wikimedia.org/T386242 [11:36:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Pool db2169 with 100%', diff saved to https://phabricator.wikimedia.org/P73673 and previous config saved to /var/cache/conftool/dbconfig/20250226-113613-marostegui.json [11:37:05] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2180.codfw.wmnet [11:37:14] (03CR) 10Clément Goubert: [C:03+1] shellbox: all replicas on PHP 8.1 (media & timeline) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122923 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [11:37:16] (03CR) 10Hnowlan: [C:03+1] shellbox: all replicas on PHP 8.1 (media & timeline) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122923 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [11:37:24] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1168.eqiad.wmnet [11:37:59] (03PS2) 10Gkyziridis: inference-services: deployment for edit-check dummy model. - Fix typo in edit-check folder. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) [11:39:03] (03CR) 10CI reject: [V:04-1] inference-services: deployment for edit-check dummy model. - Fix typo in edit-check folder. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [11:39:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1206 db2170 T385561', diff saved to https://phabricator.wikimedia.org/P73674 and previous config saved to /var/cache/conftool/dbconfig/20250226-113935-root.json [11:39:40] T385561: Upgrade and rebuild s1 - https://phabricator.wikimedia.org/T385561 [11:39:44] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2170.codfw.wmnet [11:39:50] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1206.eqiad.wmnet [11:41:43] (03CR) 10Effie Mouzeli: [C:03+2] shellbox: all replicas on PHP 8.1 (media & timeline) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122923 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [11:42:23] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10582448 (10Vgutierrez) p:05Triage→03Medium [11:42:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1024.eqiad.wmnet [11:42:53] (03Merged) 10jenkins-bot: shellbox: all replicas on PHP 8.1 (media & timeline) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122923 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [11:43:28] (03PS3) 10Gkyziridis: inference-services: deployment for edit-check dummy model. - Fix typo in edit-check folder. - Add newest image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) [11:43:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2180.codfw.wmnet [11:44:23] (03CR) 10CI reject: [V:04-1] inference-services: deployment for edit-check dummy model. - Fix typo in edit-check folder. - Add newest image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [11:45:01] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [11:45:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1168.eqiad.wmnet [11:45:33] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2180.codfw.wmnet with reason: Index rebuild [11:45:38] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [11:45:49] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1168.eqiad.wmnet with reason: Index rebuild [11:45:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2170.codfw.wmnet [11:46:22] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2170.codfw.wmnet with reason: Index rebuild [11:46:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73675 and previous config saved to /var/cache/conftool/dbconfig/20250226-114640-root.json [11:48:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1206.eqiad.wmnet [11:49:32] !log uploaded gobgpd 3.33 to apt.wm.o (bookworm-wikimedia) - T386687 [11:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:36] T386687: backport gobgp 3.33 from trixie - https://phabricator.wikimedia.org/T386687 [11:49:37] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1206.eqiad.wmnet with reason: Index rebuild [11:49:46] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [11:50:47] (03PS4) 10Gkyziridis: inference-services: deployment for edit-check dummy model. - Fix typo in edit-check folder. - Add newest image version - Try to fix the failing linting step [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) [11:51:43] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [11:52:20] (03CR) 10CI reject: [V:04-1] inference-services: deployment for edit-check dummy model. - Fix typo in edit-check folder. - Add newest image version - Try to fix the failing linting step [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [11:52:43] (03PS1) 10Vgutierrez: cumin: Remove lvs-magru alias [puppet] - 10https://gerrit.wikimedia.org/r/1122926 (https://phabricator.wikimedia.org/T384477) [11:53:27] (03CR) 10Klausman: [C:03+1] "So, to summarize: this change is basically a no-op in prod, and when the patched version of kserve goes to prod, we get what the original " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122636 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [11:53:28] (03CR) 10Volans: [C:03+1] "LGTM, do we need equivalent aliases for liberica?" [puppet] - 10https://gerrit.wikimedia.org/r/1122926 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [11:54:13] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1024.eqiad.wmnet with reason: remove from cluster for reimage [11:54:20] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10582490 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=dce06e0b-27de-4e76-8cf6-d4947764ef79) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [11:54:50] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1044.eqiad.wmnet [11:55:10] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10582493 (10ops-monitoring-bot) Draining ganeti1044.eqiad.wmnet of running VMs [11:55:26] (03CR) 10Vgutierrez: [C:03+2] cumin: Remove lvs-magru alias [puppet] - 10https://gerrit.wikimedia.org/r/1122926 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [11:56:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1044.eqiad.wmnet [11:59:54] (03PS5) 10Gkyziridis: inference-services: deployment for edit-check dummy model. - Fix typo in edit-check folder. - Add newest image version - Try to fix the failing linting step - Copy hooks from readability model [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) [11:59:56] (03CR) 10Elukey: "This change removes the extra security context from isvcs, so it should regenerate them etc.. but practically yes, these extra bits are ha" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122636 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [12:00:05] mvolz: I, the Bot under the Fountain, call upon thee, The Deployer, to do Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T1200). [12:01:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1155.eqiad.wmnet with reason: Index rebuild [12:02:31] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [12:02:55] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [12:03:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:04:08] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122668 (owner: 10PipelineBot) [12:04:08] (03CR) 10Klausman: [C:03+1] "Ack, ty!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122636 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [12:05:18] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122668 (owner: 10PipelineBot) [12:06:23] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [12:06:38] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [12:06:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es1037 with weight 0 T387273', diff saved to https://phabricator.wikimedia.org/P73676 and previous config saved to /var/cache/conftool/dbconfig/20250226-120649-root.json [12:06:51] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es6 T387273 [12:06:53] T387273: Switchover es6 master (es1038 -> es1037) - https://phabricator.wikimedia.org/T387273 [12:07:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1044.eqiad.wmnet [12:07:13] (03CR) 10Marostegui: [C:03+2] mariadb: Promote es1037 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1122894 (https://phabricator.wikimedia.org/T387273) (owner: 10Gerrit maintenance bot) [12:07:19] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10582517 (10ops-monitoring-bot) Draining ganeti1044.eqiad.wmnet of running VMs [12:07:44] !log Starting es6 eqiad failover from es1038 to es1037 - T387273 [12:07:46] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1024.eqiad.wmnet [12:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1037 to es6 primary and set section read-write T387273', diff saved to https://phabricator.wikimedia.org/P73677 and previous config saved to /var/cache/conftool/dbconfig/20250226-120806-root.json [12:08:21] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [12:08:44] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [12:08:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1038 T387273', diff saved to https://phabricator.wikimedia.org/P73678 and previous config saved to /var/cache/conftool/dbconfig/20250226-120848-root.json [12:09:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add weight to es1037', diff saved to https://phabricator.wikimedia.org/P73679 and previous config saved to /var/cache/conftool/dbconfig/20250226-120925-root.json [12:10:03] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [12:10:25] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for es1038.eqiad.wmnet [12:10:32] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [12:10:44] (03CR) 10Brouberol: [C:03+2] airflow-analytics-product: migrate the scheduler and the DB to Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122591 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol) [12:13:48] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [12:14:17] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [12:19:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es1038.eqiad.wmnet [12:20:45] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122927 [12:23:18] (03CR) 10Gkyziridis: inference-services: deployment for edit-check dummy model. - Fix typo in edit-check folder. - Add newest image version - Try to fix the f (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [12:24:15] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1038.eqiad.wmnet with reason: Index rebuild [12:27:31] (03PS1) 10Muehlenhoff: Blacklist hfs/hfsplus [puppet] - 10https://gerrit.wikimedia.org/r/1122929 [12:29:07] (03Abandoned) 10Ayounsi: Netbox: fetch GQL queries from files [software/homer] - 10https://gerrit.wikimedia.org/r/1122133 (owner: 10Ayounsi) [12:33:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:33:53] (03CR) 10Ayounsi: Expose _gql_execute to wmf-netbox + fetch GQL queries from files (034 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [12:34:05] (03PS5) 10Ayounsi: Expose _gql_execute to wmf-netbox + fetch GQL queries from files [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 [12:34:51] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3643 MB (3% inode=98%): /tmp 3643 MB (3% inode=98%): /var/tmp 3643 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [12:36:09] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host backup1014.eqiad.wmnet with OS bookworm [12:36:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10582628 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm [12:38:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73680 and previous config saved to /var/cache/conftool/dbconfig/20250226-123800-root.json [12:39:00] (03CR) 10Kevin Bazira: "thank you for working on this, George. usually, with model-servers that are still in the experimental phase (like this edit-check), we dep" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [12:40:11] (03CR) 10CI reject: [V:04-1] Expose _gql_execute to wmf-netbox + fetch GQL queries from files [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [12:44:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73681 and previous config saved to /var/cache/conftool/dbconfig/20250226-124433-root.json [12:46:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73682 and previous config saved to /var/cache/conftool/dbconfig/20250226-124602-root.json [12:46:21] (03PS1) 10Hnowlan: mobileapps: scrape all ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122932 (https://phabricator.wikimedia.org/T372749) [12:53:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1038 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73683 and previous config saved to /var/cache/conftool/dbconfig/20250226-125305-root.json [12:57:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1024.eqiad.wmnet [12:59:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73684 and previous config saved to /var/cache/conftool/dbconfig/20250226-125938-root.json [13:01:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73685 and previous config saved to /var/cache/conftool/dbconfig/20250226-130107-root.json [13:03:29] (03PS6) 10Gkyziridis: inference-services: deployment for edit-check dummy model. - Fix typo in edit-check folder. - Add newest image version - Try to fix the failing linting step - Copy hooks from readability model - Add edit-check under /experimental/values-ml-staging-codfw.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) [13:03:42] FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:07:27] (03CR) 10Gkyziridis: "Thnx for reviewing this patch Kevin. I am not sure if I understood completely your comment, should I remove the folder '/helm.d/ml-service" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [13:07:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1024.eqiad.wmnet [13:07:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti1024.eqiad.wmnet [13:08:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1038 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73686 and previous config saved to /var/cache/conftool/dbconfig/20250226-130810-root.json [13:08:42] RESOLVED: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:14:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73687 and previous config saved to /var/cache/conftool/dbconfig/20250226-131443-root.json [13:16:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73688 and previous config saved to /var/cache/conftool/dbconfig/20250226-131612-root.json [13:19:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [13:19:48] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1014.eqiad.wmnet with OS bookworm [13:19:55] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10582738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host backup1014.eqiad.wmnet with OS bookworm executed with errors: - backup1... [13:23:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73689 and previous config saved to /var/cache/conftool/dbconfig/20250226-132315-root.json [13:23:47] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:24:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [13:24:14] !log testing gobgp 3.33 in lvs1013 - T386687 [13:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:18] T386687: backport gobgp 3.33 from trixie - https://phabricator.wikimedia.org/T386687 [13:25:16] (03CR) 10Kevin Bazira: "yes, for now the `helmfile.d/ml-services/edit-check/*` config doesn't have to be deployed since you'll likely end up using the `revision-m" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [13:26:44] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti1024 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1122895 (owner: 10Muehlenhoff) [13:29:31] PROBLEM - Disk space on titan2001 is CRITICAL: DISK CRITICAL - free space: /srv 14884MiB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops [13:29:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73690 and previous config saved to /var/cache/conftool/dbconfig/20250226-132948-root.json [13:31:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73691 and previous config saved to /var/cache/conftool/dbconfig/20250226-133118-root.json [13:33:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:34:51] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3554 MB (3% inode=98%): /tmp 3554 MB (3% inode=98%): /var/tmp 3554 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [13:36:03] PROBLEM - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:36:32] (03CR) 10Muehlenhoff: "Few things inline, otherwise LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1122562 (https://phabricator.wikimedia.org/T385947) (owner: 10Slyngshede) [13:36:50] (03CR) 10Ladsgroup: [C:04-1] "You probably can just adjust this patch to allow them only. Does that sound good?" [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04) [13:36:52] (03PS1) 10Brouberol: airflow-analytics-product: add missing database value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122944 (https://phabricator.wikimedia.org/T380623) [13:38:04] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 29263 [13:38:08] (03CR) 10Jennifer Ebe: [C:03+1] airflow-analytics-product: add missing database value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122944 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol) [13:38:15] (03CR) 10Brouberol: [C:03+2] airflow-analytics-product: add missing database value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122944 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol) [13:38:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73692 and previous config saved to /var/cache/conftool/dbconfig/20250226-133820-root.json [13:38:43] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 29263 [13:38:52] (03PS1) 10Marostegui: valid_sections.pp: Add ms1, ms2, and ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1122945 (https://phabricator.wikimedia.org/T387332) [13:39:12] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [13:39:39] (03CR) 10Ladsgroup: [C:03+1] "<3 <3 <3" [puppet] - 10https://gerrit.wikimedia.org/r/1122945 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [13:40:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [13:40:10] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [13:41:07] (03PS1) 10Ladsgroup: Set commons categorylinks migration to WRITE_BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122946 (https://phabricator.wikimedia.org/T385164) [13:41:07] (03CR) 10Alexandros Kosiaris: [C:03+1] Re-enroll 5% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122655 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [13:44:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73693 and previous config saved to /var/cache/conftool/dbconfig/20250226-134453-root.json [13:45:04] (03CR) 10Jgiannelos: [C:03+1] mobileapps: scrape all ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122932 (https://phabricator.wikimedia.org/T372749) (owner: 10Hnowlan) [13:45:27] (03PS1) 10Brouberol: airflow-analytics-product: remove import mode overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122948 [13:45:55] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1122898 (owner: 10Slyngshede) [13:46:13] (03CR) 10Jennifer Ebe: [C:03+1] airflow-analytics-product: remove import mode overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122948 (owner: 10Brouberol) [13:46:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73694 and previous config saved to /var/cache/conftool/dbconfig/20250226-134623-root.json [13:47:22] (03CR) 10Brouberol: [C:03+2] airflow-analytics-product: remove import mode overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122948 (owner: 10Brouberol) [13:49:31] RECOVERY - Disk space on titan2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops [13:52:17] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [13:52:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [13:54:54] (03CR) 10Brouberol: [V:03+1 C:03+2] airflow-analytics-product: disable and remove the airflow systemd services [puppet] - 10https://gerrit.wikimedia.org/r/1122592 (https://phabricator.wikimedia.org/T380623) (owner: 10Brouberol) [13:57:37] (03PS1) 10Slyngshede: Add option to delete a single signup [software/bitu] - 10https://gerrit.wikimedia.org/r/1122951 [13:57:45] !log dropped vote_log and arbcom1_vote tables on English Wikipedia (T376627) [13:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:48] T376627: Drop ad-hoc or obsolete tables in production - https://phabricator.wikimedia.org/T376627 [13:59:08] !log installing tiff security updates [13:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T1400). [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:00:28] ooh nice [14:01:00] Lucas_WMDE: would you merge core patches on master as part of the deployment window? [14:01:09] (03CR) 10Ladsgroup: [C:03+2] Set commons categorylinks migration to WRITE_BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122946 (https://phabricator.wikimedia.org/T385164) (owner: 10Ladsgroup) [14:01:26] *confused* [14:01:32] anyway, I can’t deploy today, sorry [14:01:35] 😈 [14:01:49] Deploy I can take care of myself :D [14:02:02] (03Merged) 10jenkins-bot: Set commons categorylinks migration to WRITE_BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122946 (https://phabricator.wikimedia.org/T385164) (owner: 10Ladsgroup) [14:03:08] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1122946|Set commons categorylinks migration to WRITE_BOTH (T385164)]] [14:03:12] T385164: Set categorylinks to write both - https://phabricator.wikimedia.org/T385164 [14:03:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:05:10] (03CR) 10Elukey: [C:03+1] Blacklist hfs/hfsplus [puppet] - 10https://gerrit.wikimedia.org/r/1122929 (owner: 10Muehlenhoff) [14:06:22] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1122946|Set commons categorylinks migration to WRITE_BOTH (T385164)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:06:52] (03PS4) 10Slyngshede: Ensure that the LDAP user is parsed as an Entry object. [software/bitu] - 10https://gerrit.wikimedia.org/r/1122562 (https://phabricator.wikimedia.org/T385947) [14:07:28] (03PS5) 10Slyngshede: Ensure that the LDAP user is parsed as an Entry object. [software/bitu] - 10https://gerrit.wikimedia.org/r/1122562 (https://phabricator.wikimedia.org/T385947) [14:08:10] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [14:09:09] (03PS6) 10Slyngshede: Ensure that the LDAP user is parsed as an Entry object. [software/bitu] - 10https://gerrit.wikimedia.org/r/1122562 (https://phabricator.wikimedia.org/T385947) [14:10:05] (03CR) 10Slyngshede: Ensure that the LDAP user is parsed as an Entry object. (035 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1122562 (https://phabricator.wikimedia.org/T385947) (owner: 10Slyngshede) [14:10:47] (03CR) 10Elukey: aux-k8s-ctrl codfw: apply role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1122170 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [14:11:37] (03CR) 10Giuseppe Lavagetto: mediawiki: introduce feature flags (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto) [14:12:27] (03CR) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 (owner: 10Giuseppe Lavagetto) [14:13:01] (03CR) 10Muehlenhoff: "Three comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [14:14:13] (03PS3) 10Muehlenhoff: php: use component/pcre2 when using Php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1122901 (https://phabricator.wikimedia.org/T387276) (owner: 10Hashar) [14:14:13] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1122901 (https://phabricator.wikimedia.org/T387276) (owner: 10Hashar) [14:14:39] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122946|Set commons categorylinks migration to WRITE_BOTH (T385164)]] (duration: 11m 31s) [14:14:40] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Observability-Logging, 13Patch-For-Review: decommission logstash102[6-9] - https://phabricator.wikimedia.org/T383287#10583004 (10colewhite) [14:14:43] T385164: Set categorylinks to write both - https://phabricator.wikimedia.org/T385164 [14:14:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3636 MB (3% inode=98%): /tmp 3636 MB (3% inode=98%): /var/tmp 3636 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [14:14:55] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Observability-Logging, 13Patch-For-Review: decommission logstash102[6-9] - https://phabricator.wikimedia.org/T383287#10583009 (10colewhite) a:05colewhite→03None [14:15:44] (03CR) 10Elukey: "For the etcd k8s cluster it is unclear to me if we need a backup, we can probably raise the question to the kubernetes SIG and decide a st" [puppet] - 10https://gerrit.wikimedia.org/r/1120602 (https://phabricator.wikimedia.org/T385727) (owner: 10Herron) [14:16:20] (03CR) 10Elukey: [C:03+2] kserve-inference: remove the need for the kserve container's securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122636 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [14:18:43] (03CR) 10Cwhite: [C:03+2] site: clean up logstash102[6789] configs [puppet] - 10https://gerrit.wikimedia.org/r/1122691 (https://phabricator.wikimedia.org/T383287) (owner: 10Cwhite) [14:19:01] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:20:32] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:21:02] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host backup1014.eqiad.wmnet with OS bookworm [14:21:12] PROBLEM - BFD status on cr1-magru is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:22:08] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [14:22:10] (03CR) 10Jforrester: Deduplicate JsonConfig config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 (owner: 10Bartosz Dziewoński) [14:22:35] (03CR) 10Jforrester: [C:03+1] Remove unused config variable $wgJsonConfigInterwikiPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122683 (owner: 10Bartosz Dziewoński) [14:23:12] RECOVERY - BFD status on cr1-magru is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:24:03] (03PS1) 10Ayounsi: Add exporter port to gNMI metrics instance label [puppet] - 10https://gerrit.wikimedia.org/r/1122955 (https://phabricator.wikimedia.org/T387287) [14:24:52] (03CR) 10Jforrester: [C:03+1] Remove $wmgUseGraphWithJsonNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122709 (https://phabricator.wikimedia.org/T124748) (owner: 10Bartosz Dziewoński) [14:28:49] (03PS7) 10Gkyziridis: inference-services: deployment for edit-check dummy model. - Fix typo in edit-check folder. - Add newest image version - Try to fix the failing linting step - Copy hooks from readability model - Add edit-check under /experimental/values-ml-staging-codfw.yaml - Remove edit-check folder deploy it only under experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabric [14:30:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [14:30:13] (03PS1) 10Ayounsi: Duplicate gNMI BGP session state to metric with peer_descr as instance [puppet] - 10https://gerrit.wikimedia.org/r/1122957 (https://phabricator.wikimedia.org/T387287) [14:30:58] (03PS1) 10Vgutierrez: liberica: Enable gobgpd pprof/metrics endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1122958 (https://phabricator.wikimedia.org/T386687) [14:30:59] (03PS1) 10Vgutierrez: prometheus::ops: Gather gobgpd metrics on liberica hosts [puppet] - 10https://gerrit.wikimedia.org/r/1122959 (https://phabricator.wikimedia.org/T386687) [14:31:14] (03CR) 10Gkyziridis: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [14:31:44] (03PS2) 10Vgutierrez: liberica: Enable gobgpd pprof/metrics endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1122958 (https://phabricator.wikimedia.org/T386687) [14:31:44] (03PS2) 10Vgutierrez: prometheus::ops: Gather gobgpd metrics on liberica hosts [puppet] - 10https://gerrit.wikimedia.org/r/1122959 (https://phabricator.wikimedia.org/T386687) [14:32:03] (03CR) 10Bking: wdqs: add routing for legacy full graph host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [14:32:24] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122958 (https://phabricator.wikimedia.org/T386687) (owner: 10Vgutierrez) [14:33:30] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1122955 (https://phabricator.wikimedia.org/T387287) (owner: 10Ayounsi) [14:34:41] (03CR) 10CI reject: [V:04-1] prometheus::ops: Gather gobgpd metrics on liberica hosts [puppet] - 10https://gerrit.wikimedia.org/r/1122959 (https://phabricator.wikimedia.org/T386687) (owner: 10Vgutierrez) [14:35:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [14:35:34] (03CR) 10Gkyziridis: inference-services: deployment for edit-check dummy model. - Fix typo in edit-check folder. - Add newest image version - Try to fix the f (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [14:36:05] (03PS3) 10Bking: wdqs: add routing for legacy full graph host [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [14:36:05] (03CR) 10Ayounsi: "Yep, making sure I get the +1 from Cathal before deploying. Then feel free to deploy when I'm away." [puppet] - 10https://gerrit.wikimedia.org/r/1122955 (https://phabricator.wikimedia.org/T387287) (owner: 10Ayounsi) [14:36:19] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1014.eqiad.wmnet with reason: host reimage [14:36:25] (03CR) 10Bking: wdqs: add routing for legacy full graph host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [14:36:31] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1122957 (https://phabricator.wikimedia.org/T387287) (owner: 10Ayounsi) [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:29] (03PS3) 10Vgutierrez: liberica: Enable gobgpd pprof/metrics endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1122958 (https://phabricator.wikimedia.org/T386687) [14:39:30] (03PS3) 10Vgutierrez: prometheus::ops: Gather gobgpd metrics on liberica hosts [puppet] - 10https://gerrit.wikimedia.org/r/1122959 (https://phabricator.wikimedia.org/T386687) [14:39:35] (03CR) 10Ayounsi: Duplicate gNMI BGP session state to metric with peer_descr as instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1122957 (https://phabricator.wikimedia.org/T387287) (owner: 10Ayounsi) [14:39:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1014.eqiad.wmnet with reason: host reimage [14:40:52] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: java updates - bking@cumin2002 - T377938 [14:41:33] (03PS13) 10Brouberol: global_config: add external services for opensearch clusters [puppet] - 10https://gerrit.wikimedia.org/r/1122900 (https://phabricator.wikimedia.org/T380752) [14:42:22] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122958 (https://phabricator.wikimedia.org/T386687) (owner: 10Vgutierrez) [14:44:05] (03PS2) 10Gergő Tisza: CentralAuth: Enable SUL3 signup on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120968 (https://phabricator.wikimedia.org/T384007) [14:45:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73695 and previous config saved to /var/cache/conftool/dbconfig/20250226-144555-root.json [14:48:18] (03PS2) 10Hnowlan: mobileapps: add networkpolicy for prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122932 (https://phabricator.wikimedia.org/T372749) [14:48:53] (03PS1) 10Kamila Součková: benthos: add input/output config to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214) [14:49:09] (03PS3) 10Hnowlan: mobileapps: add networkpolicy for prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122932 (https://phabricator.wikimedia.org/T372749) [14:49:15] Amir1: are you still deploying? I have a last-minute addition to the window [14:51:23] (not logged in on the deploy host so I guess not) [14:52:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120968 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [14:53:32] (03PS1) 10Jforrester: wikifunctions: Update evaluators from 2025-02-20-142923 to 2025-02-24-145135 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122962 (https://phabricator.wikimedia.org/T386972) [14:53:33] (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-02-20-140756 to 2025-02-25-210518 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122963 (https://phabricator.wikimedia.org/T379977) [14:53:53] tgr|away: yeah, I went for lunch [14:55:14] (03Merged) 10jenkins-bot: CentralAuth: Enable SUL3 signup on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120968 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [14:55:40] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1120968|CentralAuth: Enable SUL3 signup on group 0 (T384007)]] [14:55:44] T384007: SUL3 Phase 1: All new account creation on group 0 wikis - https://phabricator.wikimedia.org/T384007 [14:57:05] (03CR) 10Jgiannelos: [C:03+1] mobileapps: add networkpolicy for prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122932 (https://phabricator.wikimedia.org/T372749) (owner: 10Hnowlan) [14:57:34] (03CR) 10Hnowlan: [C:03+2] mobileapps: add networkpolicy for prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122932 (https://phabricator.wikimedia.org/T372749) (owner: 10Hnowlan) [14:58:25] (03CR) 10Kevin Bazira: inference-services: deployment for edit-check dummy model. - Fix typo in edit-check folder. - Add newest image version - Try to fix the f (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [14:58:36] !log tgr@deploy2002 tgr: Backport for [[gerrit:1120968|CentralAuth: Enable SUL3 signup on group 0 (T384007)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:58:45] (03Merged) 10jenkins-bot: mobileapps: add networkpolicy for prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122932 (https://phabricator.wikimedia.org/T372749) (owner: 10Hnowlan) [14:59:20] (03CR) 10Eevans: [C:03+2] sessionstore: upgrade to 'dev' (Cassandra 4.1.8) [puppet] - 10https://gerrit.wikimedia.org/r/1122695 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans) [15:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T1500) [15:00:06] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10583340 (10Jhancock.wm) @MatthewVernon how's the two OS drives looking now? [15:01:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73696 and previous config saved to /var/cache/conftool/dbconfig/20250226-150100-root.json [15:01:22] tgr|away: Deploy complete? OK if I start the Wikifunctions service deploy? [15:01:28] (03PS1) 10Sergio Gimeno: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122967 (https://phabricator.wikimedia.org/T386979) [15:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:43] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:02:53] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:03:10] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore2004.codfw.wmnet: Upgrading to Cassandra 4.1.8 — T385819 - eevans@cumin1002 [15:03:37] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:04:02] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:04:26] (03CR) 10Fabfur: [C:03+1] prometheus::ops: Gather gobgpd metrics on liberica hosts [puppet] - 10https://gerrit.wikimedia.org/r/1122959 (https://phabricator.wikimedia.org/T386687) (owner: 10Vgutierrez) [15:04:27] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:04:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:04:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1014.eqiad.wmnet with OS bookworm [15:05:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1044.eqiad.wmnet [15:05:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10583372 (10Jclark-ctr) END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jc... [15:05:31] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10583373 (10Jclark-ctr) [15:06:01] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10583375 (10Jclark-ctr) 05Open→03Resolved a:05jcrespo→03Jclark-ctr [15:06:22] (03CR) 10Scott French: [C:03+1] "Thanks, Antoine!" [puppet] - 10https://gerrit.wikimedia.org/r/1122901 (https://phabricator.wikimedia.org/T387276) (owner: 10Hashar) [15:06:42] (03CR) 10Sergio Gimeno: [C:03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122967 (https://phabricator.wikimedia.org/T386979) (owner: 10Sergio Gimeno) [15:06:57] James_F: just a sec, I'll roll back [15:07:02] !log tgr@deploy2002 Sync cancelled. [15:07:02] No worries. [15:07:14] (03CR) 10Fabfur: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1122958 (https://phabricator.wikimedia.org/T386687) (owner: 10Vgutierrez) [15:07:39] (03PS1) 10TrainBranchBot: Revert "CentralAuth: Enable SUL3 signup on group 0" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122969 [15:07:39] (03CR) 10TrainBranchBot: "tgr@deploy2002 created a revert of this change as I9e8451d22cd2d975e55ddba83ed06a7e98c15398" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120968 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [15:08:07] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122967 (https://phabricator.wikimedia.org/T386979) (owner: 10Sergio Gimeno) [15:08:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122969 (owner: 10TrainBranchBot) [15:09:02] (03Merged) 10jenkins-bot: Revert "CentralAuth: Enable SUL3 signup on group 0" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122969 (owner: 10TrainBranchBot) [15:09:08] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: java updates - bking@cumin2002 - T377938 [15:09:31] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore2004.codfw.wmnet: Upgrading to Cassandra 4.1.8 — T385819 - eevans@cumin1002 [15:09:33] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1122969|Revert "CentralAuth: Enable SUL3 signup on group 0"]] [15:09:58] 10SRE-swift-storage, 06Commons: Unable to restore File:Blason_famille_fr_de-Lichy_(2).svg - https://phabricator.wikimedia.org/T387340#10583394 (10A_smart_kitten) Adding to the #sre-swift-storage queue for triage [15:10:32] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore1004.eqiad.wmnet: Upgrading to Cassandra 4.1.8 (canary) — T385819 - eevans@cumin1002 [15:12:33] !log tgr@deploy2002 tgr, trainbranchbot: Backport for [[gerrit:1122969|Revert "CentralAuth: Enable SUL3 signup on group 0"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:12:43] !log tgr@deploy2002 tgr, trainbranchbot: Continuing with sync [15:13:28] !log sgimeno@deploy2002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [15:15:06] (03CR) 10Vgutierrez: [C:03+2] liberica: Enable gobgpd pprof/metrics endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1122958 (https://phabricator.wikimedia.org/T386687) (owner: 10Vgutierrez) [15:15:44] !log sgimeno@deploy2002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [15:16:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73697 and previous config saved to /var/cache/conftool/dbconfig/20250226-151606-root.json [15:16:29] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore1004.eqiad.wmnet: Upgrading to Cassandra 4.1.8 (canary) — T385819 - eevans@cumin1002 [15:16:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73698 and previous config saved to /var/cache/conftool/dbconfig/20250226-151641-root.json [15:19:01] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122969|Revert "CentralAuth: Enable SUL3 signup on group 0"]] (duration: 09m 28s) [15:19:31] (03PS1) 10Hnowlan: citoid: migrate group1 wikis to use rest-gateway instead of restbase [puppet] - 10https://gerrit.wikimedia.org/r/1122973 (https://phabricator.wikimedia.org/T361576) [15:19:43] (03CR) 10CI reject: [V:04-1] citoid: migrate group1 wikis to use rest-gateway instead of restbase [puppet] - 10https://gerrit.wikimedia.org/r/1122973 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan) [15:20:48] !log sgimeno@deploy2002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [15:20:57] (03PS1) 10Elukey: Revert "kserve-inference: remove the need for the kserve container's securityContext" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122974 [15:21:03] (03CR) 10Jforrester: [C:03+2] wikifunctions: Update evaluators from 2025-02-20-142923 to 2025-02-24-145135 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122962 (https://phabricator.wikimedia.org/T386972) (owner: 10Jforrester) [15:21:08] (03PS2) 10Elukey: Revert "kserve-inference: remove the need for the kserve container's securityContext" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122974 [15:21:41] !log UTC afternoon deploys done [15:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:45] sorry for the delay [15:22:08] !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=maps2005.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [15:22:11] No worries, it happens. Much better to test and revert than leave it broken to meet the time window deadline. [15:22:17] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=maps2005.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [15:22:21] (03Merged) 10jenkins-bot: wikifunctions: Update evaluators from 2025-02-20-142923 to 2025-02-24-145135 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122962 (https://phabricator.wikimedia.org/T386972) (owner: 10Jforrester) [15:22:42] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=maps1005.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [15:22:50] !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=maps1005.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [15:23:54] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:24:32] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:24:46] !log sgimeno@deploy2002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [15:25:04] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:25:51] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:25:53] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:26:23] (03PS2) 10Hnowlan: citoid: migrate group1 wikis to use rest-gateway instead of restbase [puppet] - 10https://gerrit.wikimedia.org/r/1122973 (https://phabricator.wikimedia.org/T361576) [15:26:31] (03CR) 10Klausman: [C:03+1] Revert "kserve-inference: remove the need for the kserve container's securityContext" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122974 (owner: 10Elukey) [15:26:37] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:26:42] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:27:07] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:27:35] (03CR) 10Elukey: [C:03+2] Revert "kserve-inference: remove the need for the kserve container's securityContext" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122974 (owner: 10Elukey) [15:27:39] (03CR) 10Vgutierrez: [C:03+2] prometheus::ops: Gather gobgpd metrics on liberica hosts [puppet] - 10https://gerrit.wikimedia.org/r/1122959 (https://phabricator.wikimedia.org/T386687) (owner: 10Vgutierrez) [15:28:22] !log sgimeno@deploy2002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [15:28:27] !log depooled maps2009 for server move T383709 [15:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:31] T383709: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709 [15:28:35] (03CR) 10Jforrester: [C:03+2] wikifunctions: Update orchestrator from 2025-02-20-140756 to 2025-02-25-210518 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122963 (https://phabricator.wikimedia.org/T379977) (owner: 10Jforrester) [15:28:49] PROBLEM - ganeti-confd running on ganeti1024 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [15:29:03] PROBLEM - ganeti-noded running on ganeti1024 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [15:29:38] !log sgimeno@deploy2002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [15:29:46] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10583465 (10MoritzMuehlenhoff) @VRiley-WMF ganeti1044 is drained, you can move it around. [15:29:50] (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator from 2025-02-20-140756 to 2025-02-25-210518 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122963 (https://phabricator.wikimedia.org/T379977) (owner: 10Jforrester) [15:30:07] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:30:36] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:31:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73699 and previous config saved to /var/cache/conftool/dbconfig/20250226-153111-root.json [15:31:36] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ms-be2088 - https://phabricator.wikimedia.org/T387257#10583481 (10Jhancock.wm) result of testing with luca. leaving open until March 7th to catch any other errors. disregard. [15:31:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73700 and previous config saved to /var/cache/conftool/dbconfig/20250226-153146-root.json [15:32:32] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:33:39] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore[2005-2006].codfw.wmnet,sessionstore[1005-1006].eqiad.wmnet: Upgrading to Cassandra 4.1.8 — T385819 - eevans@cumin1002 [15:34:08] (03PS2) 10Kamila Součková: benthos: add input/output config to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214) [15:34:19] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:34:21] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:34:37] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122976 [15:35:10] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:37:09] (03PS3) 10Kamila Součková: benthos: add input/output config to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214) [15:39:17] (03PS8) 10Gkyziridis: inference-services: deployment for edit-check dummy model. - Fix typo in edit-check folder. - Add newest image version - Try to fix the failing linting step - Copy hooks from readability model - Add edit-check under /experimental/values-ml-staging-codfw.yaml - Remove edit-check folder deploy it only under experimental - Add MODEL_NAME at edit-check custom_env [deployment-charts] - 10https://gerri [15:39:18] (https://phabricator.wikimedia.org/T386100) [15:46:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73701 and previous config saved to /var/cache/conftool/dbconfig/20250226-154616-root.json [15:46:36] (03PS7) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [15:46:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73702 and previous config saved to /var/cache/conftool/dbconfig/20250226-154651-root.json [15:47:36] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10583538 (10cmooney) @fgiunchedi I wonder if you might have any ideas on this. Our routers and our switches are exporting timestamps with different number of digits: ` gn... [15:47:47] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10583539 (10Jhancock.wm) @elukey try now. it got disabled on the nic. [15:48:20] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356 (10RobH) 03NEW [15:50:13] (03PS9) 10Gkyziridis: inference-services: deployment for edit-check dummy model. - Fix typo in edit-check folder. - Add newest image version - Try to fix the failing linting step - Copy hooks from readability model - Add edit-check under /experimental/values-ml-staging-codfw.yaml - Remove edit-check folder deploy it only under experimental - Add MODEL_NAME at edit-check custom_env - Create a swift bucket: s3://wmf-ml- [15:50:13] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) [15:50:19] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [15:51:44] (03CR) 10Gkyziridis: inference-services: deployment for edit-check dummy model. - Fix typo in edit-check folder. - Add newest image version - Try to fix the f (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [15:53:08] (03CR) 10CI reject: [V:04-1] sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [15:55:22] (03CR) 10Bking: global_config: add external services for opensearch clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1122900 (https://phabricator.wikimedia.org/T380752) (owner: 10Brouberol) [15:55:26] (03CR) 10Mvolz: citoid: migrate group1 wikis to use rest-gateway instead of restbase (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1122973 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan) [15:56:22] jouncebot: nowandnext [15:56:22] For the next 0 hour(s) and 3 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T1500) [15:56:22] In 2 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T1800) [15:56:30] Clear here. [15:56:58] thanks, James_F! [15:58:16] (03CR) 10Effie Mouzeli: [C:03+1] Re-enroll 5% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122655 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [15:58:20] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore[2005-2006].codfw.wmnet,sessionstore[1005-1006].eqiad.wmnet: Upgrading to Cassandra 4.1.8 — T385819 - eevans@cumin1002 [15:58:23] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2075.codfw.wmnet with OS bullseye [15:58:32] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10583598 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye [15:58:47] (03CR) 10Bartosz Dziewoński: Deduplicate JsonConfig config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 (owner: 10Bartosz Dziewoński) [15:59:17] (03PS11) 10Elukey: aux-k8s-ctrl codfw: apply role [puppet] - 10https://gerrit.wikimedia.org/r/1122170 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [16:01:08] 06SRE, 06Infrastructure-Foundations, 10netops, 10Data-Platform-SRE (2025.02.10 - 2025.02.28): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10583616 (10xcollazo) @cmooney, should we move forward with this patch sometime soon? [16:01:15] (03CR) 10Elukey: "Two unresolved comments and then you are good to go!" [puppet] - 10https://gerrit.wikimedia.org/r/1122170 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [16:01:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73703 and previous config saved to /var/cache/conftool/dbconfig/20250226-160156-root.json [16:02:39] (03PS12) 10Herron: aux-k8s-ctrl codfw: apply role [puppet] - 10https://gerrit.wikimedia.org/r/1122170 (https://phabricator.wikimedia.org/T381417) [16:02:39] (03PS2) 10Ayounsi: Duplicate gNMI BGP session state to metric with peer_descr as instance [puppet] - 10https://gerrit.wikimedia.org/r/1122957 (https://phabricator.wikimedia.org/T387287) [16:03:19] !log cumin 'A:cp-text' 'disable-puppet "merging ATS Lua config change - T383845"' [16:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:24] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [16:03:44] (03CR) 10Herron: aux-k8s-ctrl codfw: apply role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1122170 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [16:03:55] (03CR) 10Scott French: [C:03+2] "Thanks again! Moving ahead with this now." [puppet] - 10https://gerrit.wikimedia.org/r/1122584 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [16:06:04] jouncebot nowandnext [16:06:04] No deployments scheduled for the next 1 hour(s) and 53 minute(s) [16:06:04] In 1 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T1800) [16:10:16] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1122562 (https://phabricator.wikimedia.org/T385947) (owner: 10Slyngshede) [16:13:28] (03CR) 10Elukey: [C:03+1] aux-k8s-ctrl codfw: apply role [puppet] - 10https://gerrit.wikimedia.org/r/1122170 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [16:17:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73704 and previous config saved to /var/cache/conftool/dbconfig/20250226-161701-root.json [16:17:21] (03CR) 10Kevin Bazira: [C:03+1] "besides a few minor nits in the commit message, the rest LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [16:17:42] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2075.codfw.wmnet with reason: host reimage [16:18:12] (03CR) 10Thcipriani: [C:03+1] gerrit: remove explicit UseG1GC flag [puppet] - 10https://gerrit.wikimedia.org/r/1122899 (https://phabricator.wikimedia.org/T387223) (owner: 10Hashar) [16:19:57] !log dropping incorrectly created tables in new wikis (T352113) [16:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:00] T352113: Move the addWiki.php maintenance script from WikimediaMaintenance into MediaWiki core - https://phabricator.wikimedia.org/T352113 [16:21:48] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2075.codfw.wmnet with reason: host reimage [16:24:07] !log brennen@deploy2002 Started deploy [phabricator/deployment@43155d4]: deploy phab2002 for T387172 [16:24:10] T387172: Deploy Phabricator/Phorge 2025-02-25 - https://phabricator.wikimedia.org/T387172 [16:24:35] !log brennen@deploy2002 Finished deploy [phabricator/deployment@43155d4]: deploy phab2002 for T387172 (duration: 00m 28s) [16:24:57] !log brennen@deploy2002 Started deploy [phabricator/deployment@43155d4]: deploy phab1004 for T387172 [16:25:46] !log brennen@deploy2002 Finished deploy [phabricator/deployment@43155d4]: deploy phab1004 for T387172 (duration: 00m 49s) [16:26:33] 10ops-magru, 06DC-Ops, 10Observability-Metrics: missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10583708 (10tappof) @wiki_willy The data is missing because Prometheus is not configured to retrieve metrics from magru's PDUs, as they are not present in NetBox. As soon as they are added... [16:34:14] (03CR) 10AikoChou: "The patch looks good to me, but the commit title is loooong lol" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [16:34:42] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10583733 (10cmooney) My robot friend suggested this which works to adjust the result of the promql to the right units: ` gnmi_bgp_neighbor_last_established{instance="$devi... [16:34:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3648 MB (3% inode=98%): /tmp 3648 MB (3% inode=98%): /var/tmp 3648 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [16:36:21] (03PS1) 10Vgutierrez: hiera,cephadm: Enable IPIP on apus@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1122986 (https://phabricator.wikimedia.org/T387290) [16:37:14] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122986 (https://phabricator.wikimedia.org/T387290) (owner: 10Vgutierrez) [16:37:33] !log cumin -b8 -s90 'A:cp-text' 'run-puppet-agent -e "merging ATS Lua config change - T383845"' [16:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:37] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [16:38:56] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10583782 (10VRiley-WMF) 05Open→03In progress Proceeding with action [16:40:20] (03PS3) 10Arlolra: Turn on Parsoid Read Views for 37 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122712 (https://phabricator.wikimedia.org/T387254) [16:41:39] (03CR) 10CI reject: [V:04-1] Turn on Parsoid Read Views for 37 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122712 (https://phabricator.wikimedia.org/T387254) (owner: 10Arlolra) [16:42:10] PROBLEM - Host ganeti1044 is DOWN: PING CRITICAL - Packet loss = 100% [16:43:06] (03PS2) 10Vgutierrez: hiera,cephadm: Enable IPIP on apus@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1122986 (https://phabricator.wikimedia.org/T387290) [16:43:06] (03PS1) 10Vgutierrez: hiera, cephadm: Enable IPIP on apus@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290) [16:43:16] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2075.codfw.wmnet with OS bullseye [16:43:22] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10583809 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye completed: - ms-be2075 (**PASS**)... [16:44:08] (03CR) 10Filippo Giunchedi: [C:03+1] Duplicate gNMI BGP session state to metric with peer_descr as instance [puppet] - 10https://gerrit.wikimedia.org/r/1122957 (https://phabricator.wikimedia.org/T387287) (owner: 10Ayounsi) [16:44:34] FIRING: ProbeDown: Service ganeti1044:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:44:38] (03PS4) 10Arlolra: Turn on Parsoid Read Views for 37 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122712 (https://phabricator.wikimedia.org/T387254) [16:44:48] PROBLEM - Host maps2009 is DOWN: PING CRITICAL - Packet loss = 100% [16:44:59] jouncebot: nowandnext [16:45:00] No deployments scheduled for the next 1 hour(s) and 15 minute(s) [16:45:00] In 1 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T1800) [16:45:19] (03PS1) 10Itamar Givon: Remove unused route file from Wikibase REST API configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122990 (https://phabricator.wikimedia.org/T383774) [16:45:47] (03PS10) 10Gkyziridis: inference-services: deployment for edit-check dummy model. - Add newest image version - Add edit-check under /experimental/values-ml-staging-codfw.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) [16:45:58] brennen: were you planning to deploy to mediawiki, or was the phab deploy the extent of it? [16:46:11] swfrench-wmf: just phab [16:46:21] awesome, thanks! [16:46:35] (03PS3) 10Vgutierrez: hiera,cephadm: Enable IPIP on apus@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1122986 (https://phabricator.wikimedia.org/T387290) [16:46:35] (03PS2) 10Vgutierrez: hiera, cephadm: Enable IPIP on apus@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1122989 (https://phabricator.wikimedia.org/T387290) [16:46:42] FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:47:11] ^ expected, ganeti2044 is moved (but it's drained of VMs) [16:47:46] (03CR) 10Gkyziridis: inference-services: deployment for edit-check dummy model. - Add newest image version - Add edit-check under /experimental/values-ml-stagi (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [16:48:33] (03CR) 10Gkyziridis: [C:03+1] "Thnx for reviewing it folks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [16:49:25] FIRING: SystemdUnitFailed: prometheus-pg-replication-lag.service on maps2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:50:42] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:51:00] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host maps2009 [16:51:01] (03CR) 10Gkyziridis: [C:03+1] "I do not have the option of +2 here. I just did a +1, probably someone else needs to merge it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [16:51:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host maps2009 [16:51:24] RECOVERY - Host maps2009 is UP: PING OK - Packet loss = 0%, RTA = 30.34 ms [16:53:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122990 (https://phabricator.wikimedia.org/T383774) (owner: 10Itamar Givon) [16:53:16] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10583883 (10Jhancock.wm) [16:54:23] !log repooled maps2009 after completed server move T383709 [16:54:25] RESOLVED: SystemdUnitFailed: prometheus-pg-replication-lag.service on maps2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:26] T383709: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709 [16:54:28] RECOVERY - Host ganeti1044 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [16:56:42] RESOLVED: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:57:22] RESOLVED: ProbeDown: Service ganeti1044:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:00:05] swfrench-wmf: That opportune time for a MediaWiki infrastructure (one-off) deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T1700). [17:01:14] o/ [17:02:00] (03CR) 10Scott French: "Thank you both for the reviews!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122655 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:02:02] (03PS1) 10Vgutierrez: hiera,titan: Enable IPIP on thanos-(query|web)@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1122995 (https://phabricator.wikimedia.org/T387291) [17:02:47] (03PS1) 10Hnowlan: switchdc: remove metal jobrunner references [cookbooks] - 10https://gerrit.wikimedia.org/r/1122996 (https://phabricator.wikimedia.org/T385155) [17:02:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122655 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:03:14] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122995 (https://phabricator.wikimedia.org/T387291) (owner: 10Vgutierrez) [17:04:00] (03Merged) 10jenkins-bot: Re-enroll 5% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122655 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:04:27] !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1122655|Re-enroll 5% of client sessions in PHP 8.1 (T383845 T385395)]] [17:04:32] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [17:04:32] T385395: 503 error when edit large size pages on PHP 8.1 - https://phabricator.wikimedia.org/T385395 [17:04:43] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122997 [17:05:44] (03PS3) 10Kimberly Sarabia: Add config for donate banner to be enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122671 (https://phabricator.wikimedia.org/T386767) [17:05:48] (03CR) 10Kamila Součková: [C:03+1] "\o/" [cookbooks] - 10https://gerrit.wikimedia.org/r/1122996 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [17:07:16] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122999 [17:07:28] !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1122655|Re-enroll 5% of client sessions in PHP 8.1 (T383845 T385395)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:08:27] !log swfrench@deploy2002 swfrench: Continuing with sync [17:09:12] (03PS2) 10Vgutierrez: hiera,titan: Enable IPIP on thanos-(query|web)@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1122995 (https://phabricator.wikimedia.org/T387291) [17:09:35] 10ops-magru, 06DC-Ops, 10Observability-Metrics: missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10583988 (10wiki_willy) Hi @tappof - thanks for looking into this. It looks like the PDUs are in Netbox though; they were added about a year ago in May 2024: https://netbox.wikimedia.org/... [17:09:45] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1122995 (https://phabricator.wikimedia.org/T387291) (owner: 10Vgutierrez) [17:10:35] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10583993 (10Jhancock.wm) @Scott_French would you be able to (or know who) could help me move conf2005 to clear u... [17:11:35] (03CR) 10Scott French: [C:03+1] switchdc: remove metal jobrunner references (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1122996 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [17:12:50] (03PS1) 10Vgutierrez: hiera,titan: Enable IPIP on thanos-(query|web)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123000 (https://phabricator.wikimedia.org/T387291) [17:13:21] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123000 (https://phabricator.wikimedia.org/T387291) (owner: 10Vgutierrez) [17:15:00] !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122655|Re-enroll 5% of client sessions in PHP 8.1 (T383845 T385395)]] (duration: 10m 33s) [17:15:05] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [17:15:06] T385395: 503 error when edit large size pages on PHP 8.1 - https://phabricator.wikimedia.org/T385395 [17:16:07] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10584031 (10Scott_French) @Jhancock.wm - Thanks for flagging! Yes, I can help you with that. I'll open a task sp... [17:16:37] 10ops-magru, 06DC-Ops, 10Observability-Metrics: missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10584034 (10tappof) Thank you, @wiki_willy, for pointing me in the right direction within NetBox. It seems the PuppetQL query might need to be updated (different model and/or type?). I'll t... [17:21:14] (03PS6) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [17:23:13] (03CR) 10Ollie Shotton: [C:03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122990 (https://phabricator.wikimedia.org/T383774) (owner: 10Itamar Givon) [17:23:47] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:26:09] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Unable to restore File:Blason_famille_fr_de-Lichy_(2).svg - https://phabricator.wikimedia.org/T387340#10584096 (10MatthewVernon) Looking harder at the timestamps around 13:12:28 just of the archive URL, going by the high-resolution timestamp order:... [17:26:19] (03PS1) 10Vgutierrez: hiera,wmcs: Enable IPIP on labweb-ssl@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123005 (https://phabricator.wikimedia.org/T387305) [17:26:44] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on ms-be2088.codfw.wmnet with reason: T381919 [17:26:47] T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919 [17:28:17] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123005 (https://phabricator.wikimedia.org/T387305) (owner: 10Vgutierrez) [17:29:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122671 (https://phabricator.wikimedia.org/T386767) (owner: 10Kimberly Sarabia) [17:33:35] (03CR) 10AikoChou: inference-services: deployment for edit-check dummy model. - Add newest image version - Add edit-check under /experimental/values-ml-stagi (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [17:33:41] (03PS1) 10Ollie Shotton: Test new term store config in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123007 (https://phabricator.wikimedia.org/T385592) [17:36:19] (03PS5) 10Simon04: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) [17:37:12] (03CR) 10Simon04: "Done. Looking forward to your review." [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04) [17:38:29] (03CR) 10CI reject: [V:04-1] www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04) [17:41:30] !log zabe@mwmaint2002:~$ mwscript namespaceDupes.php sylwiki --fix # T387266 [17:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:35] T387266: MediaWiki\Page\PageAssertionException on https://syl.wikipedia.org/w/index.php?title=ꠃꠁꠇꠤꠙꠤꠒꠤꠀ:ꠀꠅꠇꠣ_ꠘꠄꠀꠁꠘ&diff=prev&oldid=9645 - https://phabricator.wikimedia.org/T387266 [17:44:06] !log zabe@mwmaint2002:~$ mwscript namespaceDupes.php sylwiki --fix --add-prefix T387266 [17:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:28] (03CR) 10Ssingh: [C:03+2] wikimedia-ech: add ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1122155 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [17:45:46] (03PS3) 10Ssingh: wikimedia-ech: add ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1122155 (https://phabricator.wikimedia.org/T205378) [17:46:42] (03CR) 10AikoChou: "You should have +2 option as well (Tobias can fix this)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [17:47:49] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10584273 (10VRiley-WMF) 05In progress→03Resolved a:03VRiley-WMF ganeti1044 has been relocated to U30 in the same rack with the same co... [17:55:05] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1122995 (https://phabricator.wikimedia.org/T387291) (owner: 10Vgutierrez) [17:55:54] (03CR) 10Andrea Denisse: "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1123000 (https://phabricator.wikimedia.org/T387291) (owner: 10Vgutierrez) [17:56:03] (03CR) 10Andrea Denisse: [C:03+1] hiera,titan: Enable IPIP on thanos-(query|web)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123000 (https://phabricator.wikimedia.org/T387291) (owner: 10Vgutierrez) [18:00:03] (03PS2) 10Hnowlan: switchdc: remove metal jobrunner references [cookbooks] - 10https://gerrit.wikimedia.org/r/1122996 (https://phabricator.wikimedia.org/T385155) [18:00:05] swfrench-wmf: Time to snap out of that daydream and deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T1800). [18:00:13] o/ [18:00:18] (03CR) 10Hnowlan: switchdc: remove metal jobrunner references (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1122996 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [18:00:30] (03PS4) 10Kamila Součková: benthos: add input/output config to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214) [18:00:30] (03PS1) 10Kamila Součková: benthos-mw-accesslog-metrics: create deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123010 [18:00:45] I plan to use the second half hour of this window, so if anyone needs the first half hour for anything, please go ahead [18:01:37] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: java updates - bking@cumin2002 - T377938 [18:06:11] (03PS11) 10Gkyziridis: inference-services: deployment for edit-check dummy model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) [18:07:26] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-be105[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T385049#10584344 (10VRiley-WMF) [18:08:09] (03CR) 10Subramanya Sastry: [C:03+1] Turn on Parsoid Read Views for 37 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122712 (https://phabricator.wikimedia.org/T387254) (owner: 10Arlolra) [18:13:06] (03PS1) 10Cathal Mooney: Allow HTTPS connections from production to mgmt networks [homer/public] - 10https://gerrit.wikimedia.org/r/1123014 (https://phabricator.wikimedia.org/T371088) [18:32:58] (03CR) 10AikoChou: [C:03+2] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [18:34:13] (03Merged) 10jenkins-bot: inference-services: deployment for edit-check dummy model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122918 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [18:34:14] (03CR) 10Ssingh: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1122155 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [18:35:12] (03CR) 10Volans: Allow HTTPS connections from production to mgmt networks (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1123014 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [18:36:10] !log sukhe@dns1004 START - running authdns-update [18:38:10] !log sukhe@dns1004 END - running authdns-update [18:39:44] (03PS2) 10Scott French: mw-(api-int|jobrunner|parsoid): resume php8.1 rollout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122587 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [18:40:41] (03PS6) 10Simon04: www.wikipedia.org: fix "search" URL parameter [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) [18:42:31] alright, I'm back and will be making my planned changes shortly [18:48:55] (03CR) 10Cathal Mooney: Allow HTTPS connections from production to mgmt networks (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1123014 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [18:51:06] (03PS2) 10Cathal Mooney: Allow HTTPS connections from production to mgmt networks [homer/public] - 10https://gerrit.wikimedia.org/r/1123014 (https://phabricator.wikimedia.org/T371088) [18:52:17] (03CR) 10Cathal Mooney: "Thanks for the feedback @volans. I've changed this slightly, still adding a new term but not including the cumin_group as it's already al" [homer/public] - 10https://gerrit.wikimedia.org/r/1123014 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [18:53:10] (03CR) 10Cathal Mooney: Allow HTTPS connections from production to mgmt networks (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1123014 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [18:56:49] alas, my change will need to wait for now. I'll follow up later on today in an idle window. [19:00:06] dduvall and andre: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T1900). [19:02:50] (03PS2) 10Scott French: Re-enable cookie-based enrollment in 8.1 at 50% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122585 (https://phabricator.wikimedia.org/T385395) (owner: 10Effie Mouzeli) [19:03:35] (03CR) 10Scott French: "Rebased." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122585 (https://phabricator.wikimedia.org/T385395) (owner: 10Effie Mouzeli) [19:08:15] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123017 (https://phabricator.wikimedia.org/T382369) [19:08:16] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123017 (https://phabricator.wikimedia.org/T382369) (owner: 10TrainBranchBot) [19:09:09] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123017 (https://phabricator.wikimedia.org/T382369) (owner: 10TrainBranchBot) [19:10:56] 06SRE, 06Traffic, 07Wikimedia-production-error: Reproducible blocking error using the basic upload form, no upload possible - https://phabricator.wikimedia.org/T387007#10584517 (10Aklapper) [19:12:02] 06SRE, 06Traffic: Reproducible blocking error using the basic upload form, no upload possible - https://phabricator.wikimedia.org/T387007#10584521 (10Aklapper) [19:18:34] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.18 refs T382369 [19:18:38] T382369: 1.44.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T382369 [19:21:04] 06SRE, 06Traffic: Reproducible blocking error using the basic upload form, no upload possible - https://phabricator.wikimedia.org/T387007#10584529 (10ssingh) @Grand-Duc: Hi, does this still persist for you? Or has it resolved? [19:24:04] (03CR) 10Scott French: [C:03+1] "Explicitly, this still LGTM, but it would be preferable to get a second pair of eyes on this following my edits." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122587 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [19:24:18] (03CR) 10Scott French: [C:03+1] "Explicitly, this still LGTM, but it would be preferable to get a second pair of eyes on this following my edits." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122585 (https://phabricator.wikimedia.org/T385395) (owner: 10Effie Mouzeli) [19:25:20] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: java updates - bking@cumin2002 - T377938 [19:25:59] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-be105[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T385049#10584533 (10VRiley-WMF) 05Open→03Resolved [19:26:09] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-be105[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T385049#10584536 (10VRiley-WMF) This is completed [19:34:46] (03CR) 10Kamila Součková: [C:03+1] mw-(api-int|jobrunner|parsoid): resume php8.1 rollout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122587 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [19:36:24] (03CR) 10Kamila Součková: [C:03+1] Re-enable cookie-based enrollment in 8.1 at 50% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122585 (https://phabricator.wikimedia.org/T385395) (owner: 10Effie Mouzeli) [19:38:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122712 (https://phabricator.wikimedia.org/T387254) (owner: 10Arlolra) [19:41:40] (03CR) 10David Caro: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [19:44:06] dduvall: if I were to make a no-code-changes deployment (to shift a bit of traffic from PHP 7.4 to 8.1) during the latter half of your train window, would that be disruptive? no worries at all if you'd prefer I don't [19:46:14] swfrench-wmf: I'm guessing if the train has moved, and is stable, it almost certainly won't be an issue (ie the latter half won't be used anyway) [19:46:35] swfrench-wmf: i'm looking into an error at the moment but the rate is very low, so feel free to go ahead [19:46:38] also, thanks for asking [19:48:27] dduvall: Reedy: great, thank you both. I'll move ahead with my change in a couple of minutes. [19:51:42] (03PS1) 10Bvibber: Add JsonConfig's globaljsonlinks tables to catalog [puppet] - 10https://gerrit.wikimedia.org/r/1123022 (https://phabricator.wikimedia.org/T363581) [19:52:32] (03PS2) 10Bvibber: Add JsonConfig's globaljsonlinks tables to catalog [puppet] - 10https://gerrit.wikimedia.org/r/1123022 (https://phabricator.wikimedia.org/T363581) [19:59:15] (03PS3) 10Scott French: mw-(api-int|jobrunner|parsoid): resume php8.1 rollout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122587 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [20:02:19] (03CR) 10Scott French: [C:03+2] "One last issue I just noticed before merging, but should be good to go now. Thank you both!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122587 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [20:03:35] (03Merged) 10jenkins-bot: mw-(api-int|jobrunner|parsoid): resume php8.1 rollout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122587 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [20:06:27] !log swfrench@deploy2002 Started scap sync-world: helmfile-only deployment to resume capacity-based 8.1 migrations - T383845 [20:06:32] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [20:08:36] !log swfrench@deploy2002 Finished scap sync-world: helmfile-only deployment to resume capacity-based 8.1 migrations - T383845 (duration: 03m 08s) [20:25:51] (03CR) 10Bvibber: "Looks correct glancing over it, but I haven't tested the output arrays to confirm they're not missing anything yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 (owner: 10Bartosz Dziewoński) [20:25:53] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bookworm [20:32:59] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2088.codfw.wmnet with OS bookworm [20:33:06] (03PS1) 10BCornwall: cloud: update default acmechief_host host [puppet] - 10https://gerrit.wikimedia.org/r/1123028 [20:33:20] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:35:37] (03CR) 10Ssingh: [C:03+1] cloud: update default acmechief_host host [puppet] - 10https://gerrit.wikimedia.org/r/1123028 (owner: 10BCornwall) [20:35:42] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2088.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [20:43:36] (03PS1) 10Gergő Tisza: [WIP] Update CentralAuth multi-DC rules for SUL3, attempt 2 [puppet] - 10https://gerrit.wikimedia.org/r/1123029 (https://phabricator.wikimedia.org/T363695) [20:44:31] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: java updates - bking@cumin2002 - T377938 [20:45:54] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2088.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [20:47:41] ACKNOWLEDGEMENT - MD RAID on ms-be2088 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387392 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:47:53] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ms-be2088 - https://phabricator.wikimedia.org/T387392 (10ops-monitoring-bot) 03NEW [20:50:57] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bookworm [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T2100). [21:00:05] kimberly_sarabia and arlolra: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:16] Hi! I'm here [21:00:41] o/ [21:03:04] o/ [21:03:07] i can deploy :) [21:03:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:03:52] (03PS4) 10Kimberly Sarabia: Add config for donate banner to be enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122671 (https://phabricator.wikimedia.org/T386767) [21:04:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122671 (https://phabricator.wikimedia.org/T386767) (owner: 10Kimberly Sarabia) [21:05:36] (03Merged) 10jenkins-bot: Add config for donate banner to be enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122671 (https://phabricator.wikimedia.org/T386767) (owner: 10Kimberly Sarabia) [21:06:03] cjming: tysm! [21:06:26] kimberly_sarabia: should be live :) [21:06:36] (03PS5) 10Arlolra: Turn on Parsoid Read Views for 37 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122712 (https://phabricator.wikimedia.org/T387254) [21:07:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122712 (https://phabricator.wikimedia.org/T387254) (owner: 10Arlolra) [21:07:57] (03Merged) 10jenkins-bot: Turn on Parsoid Read Views for 37 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122712 (https://phabricator.wikimedia.org/T387254) (owner: 10Arlolra) [21:08:10] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2088.codfw.wmnet with OS bookworm [21:08:24] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1122712|Turn on Parsoid Read Views for 37 wiktionaries (T387254)]] [21:08:29] T387254: Parsoid Read Views to Wiktionary deploy ~2025-02-27 - https://phabricator.wikimedia.org/T387254 [21:09:25] FIRING: SystemdUnitFailed: elasticsearch-disable-readahead.service on elastic1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:10:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1126:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1126 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:10:49] arlolra: on test servers [21:10:56] looking [21:11:20] cjming: ty! [21:11:23] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.008e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [21:11:34] !log cjming@deploy2002 cjming, arlolra: Backport for [[gerrit:1122712|Turn on Parsoid Read Views for 37 wiktionaries (T387254)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:11:34] kimberly_sarabia: yw! [21:12:00] cjming: lgtm [21:12:08] great - syncing [21:12:11] !log cjming@deploy2002 cjming, arlolra: Continuing with sync [21:15:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1126:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1126 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:18:35] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122712|Turn on Parsoid Read Views for 37 wiktionaries (T387254)]] (duration: 10m 10s) [21:18:39] T387254: Parsoid Read Views to Wiktionary deploy ~2025-02-27 - https://phabricator.wikimedia.org/T387254 [21:18:51] thanks cjming [21:19:04] yw! [21:19:34] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bookworm [21:23:47] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:38:14] (03PS1) 10Gergő Tisza: CentralAuth: Enable SUL3 signup on group 0 (attempt 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123032 (https://phabricator.wikimedia.org/T384007) [21:39:02] !log disabling auto reboot for debian imaging temporarily [21:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:25] RESOLVED: SystemdUnitFailed: elasticsearch-disable-readahead.service on elastic1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:39:43] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2088.codfw.wmnet with OS bookworm [21:40:04] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bookworm [21:41:22] cjming: you are not deploying anymore, right? I'd add one more patch [21:44:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123032 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [21:45:29] (03Merged) 10jenkins-bot: CentralAuth: Enable SUL3 signup on group 0 (attempt 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123032 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [21:45:59] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1123032|CentralAuth: Enable SUL3 signup on group 0 (attempt 2) (T384007)]] [21:46:03] T384007: SUL3 Phase 1: All new account creation on group 0 wikis - https://phabricator.wikimedia.org/T384007 [21:49:03] !log tgr@deploy2002 tgr: Backport for [[gerrit:1123032|CentralAuth: Enable SUL3 signup on group 0 (attempt 2) (T384007)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:51:55] (03PS1) 10Scott French: shellbox-media: revert to PHP 7.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123033 (https://phabricator.wikimedia.org/T377038) [21:53:10] (03CR) 10Scott French: [C:03+2] shellbox-media: revert to PHP 7.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123033 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [21:54:23] (03Merged) 10jenkins-bot: shellbox-media: revert to PHP 7.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123033 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [21:56:22] tgr: sorry - yes! [21:56:51] thanks, I figured it out from the logs eventually :) [21:57:21] !log tgr@deploy2002 tgr: Continuing with sync [21:58:25] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: java updates - bking@cumin2002 - T377938 [21:58:53] tgr|away: FYI, I'm going to be running a helmfile deployment for shellbox concurrent with your rollout. should not conflict in any way, but just wanted to flag it so it's not a surprise here. [22:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T2200) [22:00:44] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [22:01:51] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [22:03:57] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123032|CentralAuth: Enable SUL3 signup on group 0 (attempt 2) (T384007)]] (duration: 17m 57s) [22:04:01] T384007: SUL3 Phase 1: All new account creation on group 0 wikis - https://phabricator.wikimedia.org/T384007 [22:04:56] !log UTC late deploys done [22:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:05] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [22:05:20] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [22:06:25] !log shellbox-media reverted to PHP 7.4 - T377038 [22:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:28] T377038: Migrate production Shellbox variants to PHP 8.1 - https://phabricator.wikimedia.org/T377038 [22:07:26] !log bking@cumin2002:~$ sudo apt-get install -y python3-opensearch T383811 [22:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:30] T383811: Ensure Search Platform-owned Elasticsearch cookbooks can handle Opensearch - https://phabricator.wikimedia.org/T383811 [22:14:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host backup2013.codfw.wmnet with OS bookworm [22:14:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host backup2014.codfw.wmnet with OS bookworm [22:14:22] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10584999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2013.codfw.wmnet with OS bookworm [22:14:23] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10585000 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2014.codfw.wmnet with OS bookworm [22:24:04] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2088.codfw.wmnet with OS bookworm [22:27:17] (03PS4) 10Bking: wdqs: add routing for legacy full graph host [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [22:27:56] (03PS5) 10Bking: wdqs: add routing for legacy full graph host [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [22:28:21] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [22:30:35] !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: security release [22:34:19] (03PS6) 10Bking: wdqs: add routing for legacy full graph host [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [22:35:27] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [22:35:39] dzahn@cumin1002 dzahn: The backup on gitlab1004 is complete, ready to proceed with upgrade. [22:39:46] !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release [22:40:38] (03PS7) 10Bking: wdqs: add routing for legacy full graph host [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [22:40:46] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [22:42:18] !log dzahn@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: security release [22:43:29] (03PS8) 10Bking: wdqs: add routing for legacy full graph host [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [22:44:52] dzahn@cumin1002 dzahn: The backup on gitlab1003 is complete, ready to proceed with upgrade. [22:45:22] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [22:50:51] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bookworm [22:53:27] !log dzahn@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: security release [22:54:56] !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: security release [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250226T2300) [23:08:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup2013.codfw.wmnet with OS bookworm [23:08:13] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10585249 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2013.codfw.wmnet with OS bookworm executed with errors: - backu... [23:08:23] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.004e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [23:12:15] (03CR) 10Ladsgroup: "Thanks. I think we need this added for testcommonswiki too (unless we are planning to drop it once done)?" [puppet] - 10https://gerrit.wikimedia.org/r/1123022 (https://phabricator.wikimedia.org/T363581) (owner: 10Bvibber) [23:15:14] (03CR) 10Ladsgroup: "Thanks I will try to get this deployed soon" [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04) [23:18:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host backup2013.codfw.wmnet with OS bookworm [23:18:54] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10585282 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2013.codfw.wmnet with OS bookworm [23:21:20] (03PS1) 10Kimberly Sarabia: Disable donate link in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123046 (https://phabricator.wikimedia.org/T386767) [23:22:25] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.008e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [23:22:38] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2088.codfw.wmnet with OS bookworm [23:26:37] (03CR) 10Jdlrobson: [C:04-1] Disable donate link in beta (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123046 (https://phabricator.wikimedia.org/T386767) (owner: 10Kimberly Sarabia) [23:28:26] (03PS2) 10Kimberly Sarabia: Disable donate link in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123046 (https://phabricator.wikimedia.org/T386767) [23:31:55] (03CR) 10Jdlrobson: [C:04-1] Disable donate link in beta (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123046 (https://phabricator.wikimedia.org/T386767) (owner: 10Kimberly Sarabia) [23:36:16] (03PS3) 10Kimberly Sarabia: Disable donate link in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123046 (https://phabricator.wikimedia.org/T386767) [23:41:14] (03CR) 10Kimberly Sarabia: Disable donate link in beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123046 (https://phabricator.wikimedia.org/T386767) (owner: 10Kimberly Sarabia) [23:42:10] (03PS1) 10Effie Mouzeli: WIP: introduce mw-experimental functionality [puppet] - 10https://gerrit.wikimedia.org/r/1123048 (https://phabricator.wikimedia.org/T276994) [23:42:33] (03CR) 10CI reject: [V:04-1] WIP: introduce mw-experimental functionality [puppet] - 10https://gerrit.wikimedia.org/r/1123048 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [23:55:19] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Unable to restore File:Blason_famille_fr_de-Lichy_(2).svg - https://phabricator.wikimedia.org/T387340#10585396 (10Pppery) What have happened on the MediaWiki side is: ` 13:12, 25 February 2025 Sreejithk2000 talk contribs moved page File:Blason fam...